Abstract
With the increasing amount of data in deep web sources
(hidden from general search engines behind web forms), ac-
cessing this data has gained more attention. In the algo-
rithms applied for this purpose, it is the knowledge of a data
source size that enables the algorithms to make accurate de-
cisions in stopping the crawling or sampling processes which
can be so costly in some cases [14]. This tendency to know
the sizes of data sources is increased by the competition
among businesses on the Web in which the data coverage
is critical. In the context of quality assessment of search
engines [7], search engine selection in the federated search
engines, and in the resource/collection selection in the dis-
tributed search field [19], this information is also helpful. In
addition, it can give an insight over some useful statistics for
public sectors like governments. In any of these mentioned
scenarios, in the case of facing a non-cooperative collection
which does not publish its information, the size has to be
estimated [17]. In this paper, the suggested approaches for
this purpose in the literature are categorized and reviewed.
The most recent approaches are implemented and compared
in a real environment. Finally, four methods based on the
modification of the available techniques are introduced and
evaluated. In one of the modifications, the estimations from
other approaches could be improved ranging from 35 to 65
percent.
Original language | Undefined |
---|---|
Title of host publication | Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services (iiWAS2012) |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 239-246 |
Number of pages | 8 |
ISBN (Print) | 978-1-4503-1306-3 |
DOIs | |
Publication status | Published - 2012 |
Event | 14th International Conference on Information Integration and Web-based Applications & Services (iiWAS2012) - Bali, Indonesia Duration: 3 Dec 2012 → 5 Dec 2012 |
Conference
Conference | 14th International Conference on Information Integration and Web-based Applications & Services (iiWAS2012) |
---|---|
Period | 3/12/12 → 5/12/12 |
Other | 3-5 December 2012 |
Keywords
- Pool-Based Size Es-timation
- Estimation Bias
- Regres-sion Equations
- Size Estimation
- METIS-289755
- CR-H.3.3
- query-based sampling
- EWI-22426
- Deep Web
- Stochastic Simulation