Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
|Title of host publication||Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013)|
|Place of Publication||Amsterdam|
|Number of pages||8|
|Publication status||Published - Sep 2013|
|Name||Advances in Intelligent Systems Research|
Jundt, O., & van Keulen, M. (2013). Sample-based XPath Ranking for Web Information Extraction. In Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013) (pp. 39). (Advances in Intelligent Systems Research; Vol. 32). Amsterdam: Atlantis Press. https://doi.org/10.2991/eusflat.2013.27