How much data resides in a web collection: how to estimate size of a web collection

Mohammadreza Khelghati, Djoerd Hiemstra, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

61 Downloads (Pure)

Abstract

With increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping crawling or sampling processes which can be so costly in some cases [4]. The tendency to know the sizes of data sources is increased by the competition among businesses on the Web in which the data coverage is critical. In the context of quality assessment of search engines [2], search engine selection in the federated search engines, and in the resource/collection selection in the distributed search field [6], this information is also helpful. In addition, it can give an insight over some useful statistics for public sectors like governments. In any of these mentioned scenarios, in case of facing a non-cooperative collection which does not publish its information, the size has to be estimated [5]. In this paper, the approaches in literature are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.
Original languageUndefined
Title of host publicationProceedings of the 13th Dutch-Belgian Workshop on Information Retrieval, DIR 2013
Place of PublicationAachen, Germany
PublisherCEUR
Pages42-43
Number of pages2
Publication statusPublished - 26 Apr 2013
Event13th Dutch-Belgian Information Retrieval Workshop, DIR 2013 - Delft, Netherlands
Duration: 26 Apr 201326 Apr 2013
Conference number: 13

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR
Volume986
ISSN (Print)1613-0073

Workshop

Workshop13th Dutch-Belgian Information Retrieval Workshop, DIR 2013
Abbreviated titleDIR
Country/TerritoryNetherlands
CityDelft
Period26/04/1326/04/13

Keywords

  • EWI-23310
  • METIS-297623
  • IR-86468
  • CR-H.3.3

Cite this