Sample-based XPath Ranking for Web Information Extraction

Oliver Jundt, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

126 Downloads (Pure)


Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
Original languageUndefined
Title of host publicationProceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013)
Place of PublicationAmsterdam
PublisherAtlantis Press
Number of pages8
ISBN (Print)978-90786-77-78-9
Publication statusPublished - Sep 2013

Publication series

NameAdvances in Intelligent Systems Research
PublisherAtlantis Press
ISSN (Print)1951-6851


  • EWI-23413
  • IR-86350
  • METIS-297686

Cite this

Jundt, O., & van Keulen, M. (2013). Sample-based XPath Ranking for Web Information Extraction. In Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013) (pp. 39). (Advances in Intelligent Systems Research; Vol. 32). Amsterdam: Atlantis Press.