Harvesting All Matching Information To A Given Query From a Deep Website

Mohammadreza Khelghati, Djoerd Hiemstra, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

37 Downloads (Pure)

Abstract

In this paper, the goal is harvesting all documents matching a given (entity) query from a deep web source. The objective is to retrieve all information about for instance "Denzel Washington", "Iran Nuclear Deal", or "FC Barcelona" from data hidden behind web forms. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by applying information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.
Original languageUndefined
Title of host publicationProceedings of the 1st International Workshop on Knowledge Discovery on the Web, KDWEB 2015
EditorsGiuliano Armano, Alessandro Bozzon, Alessandro Giuliani
Place of PublicationAachen
PublisherCEUR
Pages1-7
Number of pages7
Publication statusPublished - Sep 2015
Event1st International Workshop on Knowledge Discovery on the Web, KDWEB 2015, Cagliari, Italy: Proceedings of the 1st International Workshop on Knowledge Discovery on the Web, KDWEB 2015 - Aachen
Duration: 1 Sep 2015 → …

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR-WS.org
Volume1489
ISSN (Print)1613-0073

Conference

Conference1st International Workshop on Knowledge Discovery on the Web, KDWEB 2015, Cagliari, Italy
CityAachen
Period1/09/15 → …

Keywords

  • CR-H.3.3
  • METIS-314946
  • IR-98044
  • EWI-26235

Cite this