Web content changes rapidly . In Focused Web Harvesting  which aim it is to achieve a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a set of all the relevant web data to their topics of interest. Whether you are a fan following your favorite idol or a journalist investigating a topic, you may need not only to access all the relevant information but also the recent changes and updates. General search engines like Google apply several techniques to enhance the freshness of their crawled data. However, in focused web harvesting, we lack an efficient approach that detects changes for a given topic over time. In this paper, we focus on techniques that can keep the relevant content to a given query up-to-date. To do so, we test four different approaches to efficiently harvest all the changed documents matching a given entity by querying web search engines. We define a document with changed content or a newly created or removed document as a changed document. Among the proposed change detection approaches, the FedWeb method outperforms the other approaches in finding the changed content on the web for a given query with 20 percent, on average, better performance.
|Title of host publication||Proceedings of the 18th International Conference on Information Integration and Web-based Applications & Services (iiWAS 2016)|
|Place of Publication||New York|
|Publisher||Association for Computing Machinery (ACM)|
|Number of pages||5|
|Publication status||Published - Nov 2016|
Khelghati, M., Hiemstra, D., & van Keulen, M. (2016). Efficient Web Harvesting Strategies for Monitoring Deep Web Content. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications & Services (iiWAS 2016) (pp. 389-393). New York: Association for Computing Machinery (ACM). https://doi.org/10.1145/3011141.3011198