Efficient Web Harvesting Strategies for Monitoring Deep Web Content

Mohammadreza Khelghati, Djoerd Hiemstra, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

112 Downloads (Pure)

Abstract

Web content changes rapidly [18]. In Focused Web Harvesting [17] which aim it is to achieve a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a set of all the relevant web data to their topics of interest. Whether you are a fan following your favorite idol or a journalist investigating a topic, you may need not only to access all the relevant information but also the recent changes and updates. General search engines like Google apply several techniques to enhance the freshness of their crawled data. However, in focused web harvesting, we lack an efficient approach that detects changes for a given topic over time. In this paper, we focus on techniques that can keep the relevant content to a given query up-to-date. To do so, we test four different approaches to efficiently harvest all the changed documents matching a given entity by querying web search engines. We define a document with changed content or a newly created or removed document as a changed document. Among the proposed change detection approaches, the FedWeb method outperforms the other approaches in finding the changed content on the web for a given query with 20 percent, on average, better performance.
Original languageEnglish
Title of host publicationProceedings of the 18th International Conference on Information Integration and Web-based Applications & Services (iiWAS 2016)
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Pages389-393
Number of pages5
ISBN (Print)978-1-4503-4807-2
DOIs
Publication statusPublished - Nov 2016
Event18th International Conference on Information Integration and Web-based Applications & Services, iiWAS 2016 - Singapore
Duration: 28 Nov 201630 Nov 2016

Conference

Conference18th International Conference on Information Integration and Web-based Applications & Services, iiWAS 2016
Period28/11/1630/11/16
Other28-30 November 2016

Fingerprint

Dive into the research topics of 'Efficient Web Harvesting Strategies for Monitoring Deep Web Content'. Together they form a unique fingerprint.

Cite this