Sample-based XPath Ranking for Web Information Extraction

Oliver Jundt, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

1 Citation (Scopus)
289 Downloads (Pure)

Abstract

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
Original languageUndefined
Title of host publicationProceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013)
Place of PublicationAmsterdam
PublisherAtlantis Press
Pages39
Number of pages8
ISBN (Print)978-90786-77-78-9
DOIs
Publication statusPublished - Sept 2013
Event8th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2013 - Milano, Italy
Duration: 11 Sept 201313 Sept 2013

Publication series

NameAdvances in Intelligent Systems Research
PublisherAtlantis Press
Volume32
ISSN (Print)1951-6851

Conference

Conference8th Conference of the European Society for Fuzzy Logic and Technology, EUSFLAT 2013
Period11/09/1313/09/13
Other11-13 September 2013

Keywords

  • EWI-23413
  • IR-86350
  • METIS-297686

Cite this