Efficient Web Harvesting Strategies for Monitoring Deep Web Content

Mohammadreza Khelghati, Djoerd Hiemstra, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

60 Downloads (Pure)

Abstract

Web content changes rapidly [18]. In Focused Web Harvesting [17] which aim it is to achieve a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a set of all the relevant web data to their topics of interest. Whether you are a fan following your favorite idol or a journalist investigating a topic, you may need not only to access all the relevant information but also the recent changes and updates. General search engines like Google apply several techniques to enhance the freshness of their crawled data. However, in focused web harvesting, we lack an efficient approach that detects changes for a given topic over time. In this paper, we focus on techniques that can keep the relevant content to a given query up-to-date. To do so, we test four different approaches to efficiently harvest all the changed documents matching a given entity by querying web search engines. We define a document with changed content or a newly created or removed document as a changed document. Among the proposed change detection approaches, the FedWeb method outperforms the other approaches in finding the changed content on the web for a given query with 20 percent, on average, better performance.
Original languageEnglish
Title of host publicationProceedings of the 18th International Conference on Information Integration and Web-based Applications & Services (iiWAS 2016)
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages389-393
Number of pages5
ISBN (Print)978-1-4503-4807-2
DOIs
Publication statusPublished - Nov 2016

    Fingerprint

Keywords

  • EWI-27474
  • IR-102492
  • METIS-319499

Cite this

Khelghati, M., Hiemstra, D., & van Keulen, M. (2016). Efficient Web Harvesting Strategies for Monitoring Deep Web Content. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications & Services (iiWAS 2016) (pp. 389-393). New York: Association for Computing Machinery (ACM). https://doi.org/10.1145/3011141.3011198