Ranking XPaths for extracting search result records

Research output: Book/ReportReportProfessional

24 Downloads (Pure)

Abstract

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.
Original languageUndefined
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages10
Publication statusPublished - 8 Mar 2012

Publication series

NameCTIT Technical Report Series
PublisherUniversity of Twente, Centre for Telematics and Information Technology
No.TR-CTIT-12-08
ISSN (Print)1381-3625

Keywords

  • DB-DM: DATA MINING
  • EWI-21640
  • IR-79917
  • DB-IR: INFORMATION RETRIEVAL
  • Scraper
  • Wrapper
  • Web extraction
  • Search result extraction
  • METIS-285252

Cite this

Trieschnigg, R. B., Tjin-Kam-Jet, K., & Hiemstra, D. (2012). Ranking XPaths for extracting search result records. (CTIT Technical Report Series; No. TR-CTIT-12-08). Enschede: Centre for Telematics and Information Technology (CTIT).
Trieschnigg, Rudolf Berend ; Tjin-Kam-Jet, Kien ; Hiemstra, Djoerd. / Ranking XPaths for extracting search result records. Enschede : Centre for Telematics and Information Technology (CTIT), 2012. 10 p. (CTIT Technical Report Series; TR-CTIT-12-08).
@book{71843fb52f1149509d20d946702326cc,
title = "Ranking XPaths for extracting search result records",
abstract = "Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85{\%} of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.",
keywords = "DB-DM: DATA MINING, EWI-21640, IR-79917, DB-IR: INFORMATION RETRIEVAL, Scraper, Wrapper, Web extraction, Search result extraction, METIS-285252",
author = "Trieschnigg, {Rudolf Berend} and Kien Tjin-Kam-Jet and Djoerd Hiemstra",
year = "2012",
month = "3",
day = "8",
language = "Undefined",
series = "CTIT Technical Report Series",
publisher = "Centre for Telematics and Information Technology (CTIT)",
number = "TR-CTIT-12-08",
address = "Netherlands",

}

Trieschnigg, RB, Tjin-Kam-Jet, K & Hiemstra, D 2012, Ranking XPaths for extracting search result records. CTIT Technical Report Series, no. TR-CTIT-12-08, Centre for Telematics and Information Technology (CTIT), Enschede.

Ranking XPaths for extracting search result records. / Trieschnigg, Rudolf Berend; Tjin-Kam-Jet, Kien; Hiemstra, Djoerd.

Enschede : Centre for Telematics and Information Technology (CTIT), 2012. 10 p. (CTIT Technical Report Series; No. TR-CTIT-12-08).

Research output: Book/ReportReportProfessional

TY - BOOK

T1 - Ranking XPaths for extracting search result records

AU - Trieschnigg, Rudolf Berend

AU - Tjin-Kam-Jet, Kien

AU - Hiemstra, Djoerd

PY - 2012/3/8

Y1 - 2012/3/8

N2 - Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

AB - Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

KW - DB-DM: DATA MINING

KW - EWI-21640

KW - IR-79917

KW - DB-IR: INFORMATION RETRIEVAL

KW - Scraper

KW - Wrapper

KW - Web extraction

KW - Search result extraction

KW - METIS-285252

M3 - Report

T3 - CTIT Technical Report Series

BT - Ranking XPaths for extracting search result records

PB - Centre for Telematics and Information Technology (CTIT)

CY - Enschede

ER -

Trieschnigg RB, Tjin-Kam-Jet K, Hiemstra D. Ranking XPaths for extracting search result records. Enschede: Centre for Telematics and Information Technology (CTIT), 2012. 10 p. (CTIT Technical Report Series; TR-CTIT-12-08).