Abstract
With the goal of harvesting all information about a given entity, in this paper, we try to harvest all matching documents for a given query submitted on a search engine. The objective is to retrieve all information about for instance "Michael Jackson", "Islamic State", or "FC Barcelona" from indexed data in search engines, or hidden data behind web forms, using a minimum number of queries. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. These limitations are also applied in deep web sources, for instance in social networks like Twitter. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by analysing the retrieved results and combining this analysed information with information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.
Original language | English |
---|---|
Title of host publication | Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services (iiWAS 2015) |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 65 |
Number of pages | 9 |
ISBN (Print) | 978-1-4503-3491-4 |
DOIs | |
Publication status | Published - 11 Dec 2015 |
Event | 17th International Conference on Information Integration and Web-based Applications & Services, IIWAS 2015 - Brussels, Belgium Duration: 11 Dec 2015 → 13 Dec 2015 Conference number: 17 |
Conference
Conference | 17th International Conference on Information Integration and Web-based Applications & Services, IIWAS 2015 |
---|---|
Abbreviated title | IIWAS |
Country/Territory | Belgium |
City | Brussels |
Period | 11/12/15 → 13/12/15 |
Keywords
- CR-H.3.3
- 22/3 OA procedure