Sample-based XPath Ranking for Web Information Extraction

Oliver Jundt, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

83 Downloads (Pure)

Abstract

Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
Original languageUndefined
Title of host publicationProceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013)
Place of PublicationAmsterdam
PublisherAtlantis Press
Pages39
Number of pages8
ISBN (Print)978-90786-77-78-9
DOIs
Publication statusPublished - Sep 2013

Publication series

NameAdvances in Intelligent Systems Research
PublisherAtlantis Press
Volume32
ISSN (Print)1951-6851

Keywords

  • EWI-23413
  • IR-86350
  • METIS-297686

Cite this

Jundt, O., & van Keulen, M. (2013). Sample-based XPath Ranking for Web Information Extraction. In Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013) (pp. 39). (Advances in Intelligent Systems Research; Vol. 32). Amsterdam: Atlantis Press. https://doi.org/10.2991/eusflat.2013.27
Jundt, Oliver ; van Keulen, Maurice. / Sample-based XPath Ranking for Web Information Extraction. Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013). Amsterdam : Atlantis Press, 2013. pp. 39 (Advances in Intelligent Systems Research).
@inproceedings{f8d44657781c4efe9052147e390c00f6,
title = "Sample-based XPath Ranking for Web Information Extraction",
abstract = "Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.",
keywords = "EWI-23413, IR-86350, METIS-297686",
author = "Oliver Jundt and {van Keulen}, Maurice",
note = "10.2991/eusflat.2013.27",
year = "2013",
month = "9",
doi = "10.2991/eusflat.2013.27",
language = "Undefined",
isbn = "978-90786-77-78-9",
series = "Advances in Intelligent Systems Research",
publisher = "Atlantis Press",
pages = "39",
booktitle = "Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013)",
address = "Netherlands",

}

Jundt, O & van Keulen, M 2013, Sample-based XPath Ranking for Web Information Extraction. in Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013). Advances in Intelligent Systems Research, vol. 32, Atlantis Press, Amsterdam, pp. 39. https://doi.org/10.2991/eusflat.2013.27

Sample-based XPath Ranking for Web Information Extraction. / Jundt, Oliver; van Keulen, Maurice.

Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013). Amsterdam : Atlantis Press, 2013. p. 39 (Advances in Intelligent Systems Research; Vol. 32).

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Sample-based XPath Ranking for Web Information Extraction

AU - Jundt, Oliver

AU - van Keulen, Maurice

N1 - 10.2991/eusflat.2013.27

PY - 2013/9

Y1 - 2013/9

N2 - Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.

AB - Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.

KW - EWI-23413

KW - IR-86350

KW - METIS-297686

U2 - 10.2991/eusflat.2013.27

DO - 10.2991/eusflat.2013.27

M3 - Conference contribution

SN - 978-90786-77-78-9

T3 - Advances in Intelligent Systems Research

SP - 39

BT - Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013)

PB - Atlantis Press

CY - Amsterdam

ER -

Jundt O, van Keulen M. Sample-based XPath Ranking for Web Information Extraction. In Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013). Amsterdam: Atlantis Press. 2013. p. 39. (Advances in Intelligent Systems Research). https://doi.org/10.2991/eusflat.2013.27