Abstract
Original language | Undefined |
---|---|
Title of host publication | Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013) |
Place of Publication | Amsterdam |
Publisher | Atlantis Press |
Pages | 39 |
Number of pages | 8 |
ISBN (Print) | 978-90786-77-78-9 |
DOIs | |
Publication status | Published - Sep 2013 |
Publication series
Name | Advances in Intelligent Systems Research |
---|---|
Publisher | Atlantis Press |
Volume | 32 |
ISSN (Print) | 1951-6851 |
Keywords
- EWI-23413
- IR-86350
- METIS-297686
Cite this
}
Sample-based XPath Ranking for Web Information Extraction. / Jundt, Oliver; van Keulen, Maurice.
Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013). Amsterdam : Atlantis Press, 2013. p. 39 (Advances in Intelligent Systems Research; Vol. 32).Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › Academic › peer-review
TY - GEN
T1 - Sample-based XPath Ranking for Web Information Extraction
AU - Jundt, Oliver
AU - van Keulen, Maurice
N1 - 10.2991/eusflat.2013.27
PY - 2013/9
Y1 - 2013/9
N2 - Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
AB - Web information extraction typically relies on a wrapper, i.e., program code or a configuration that specifies how to extract some information from web pages at a specific website. Manually creating and maintaining wrappers is a cumbersome and error-prone task. It may even be prohibitive as some applications require information extraction from previously unseen websites. This paper approaches the problem of automatic on-the-fly wrapper creation for websites that provide attribute data for objects in a ‘search – search result page – detail page’ setup. The approach is a wrapper induction approach which uses a small and easily obtainable set of sample data for ranking XPaths on their suitability for extracting the wanted attribute data. Experiments show that the automatically generated top-ranked XPaths indeed extract the wanted data. Moreover, it appears that 20 to 25 input samples suffice for finding a suitable XPath for an attribute.
KW - EWI-23413
KW - IR-86350
KW - METIS-297686
U2 - 10.2991/eusflat.2013.27
DO - 10.2991/eusflat.2013.27
M3 - Conference contribution
SN - 978-90786-77-78-9
T3 - Advances in Intelligent Systems Research
SP - 39
BT - Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013)
PB - Atlantis Press
CY - Amsterdam
ER -