Ranking XPaths for extracting search result records

Rudolf Berend Trieschnigg, Kien Tjin-Kam-Jet, Djoerd Hiemstra

Research output: Book/ReportReportProfessional

97 Downloads (Pure)


Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.
Original languageUndefined
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages10
Publication statusPublished - 8 Mar 2012

Publication series

NameCTIT Technical Report Series
PublisherUniversity of Twente, Centre for Telematics and Information Technology
ISSN (Print)1381-3625


  • EWI-21640
  • IR-79917
  • Scraper
  • Wrapper
  • Web extraction
  • Search result extraction
  • METIS-285252

Cite this