The change of the web content is rapid. In Focused Web Harvesting [?], which aims at achieving a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a complete set of related web data to their interesting topics. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.
|Place of Publication||Enschede|
|Publisher||Centre for Telematics and Information Technology (CTIT)|
|Number of pages||9|
|Publication status||Published - 15 May 2016|
|Name||CTIT Technical Report Series|
|Publisher||University of Twente, Centre for Telematics and Information Technology (CTIT)|
Khelghati, M., Hiemstra, D., & van Keulen, M. (2016). Efficient Web Harvesting Strategies for Monitoring Deep Web Content. (CTIT Technical Report Series; No. TR-CTIT-16-05). Enschede: Centre for Telematics and Information Technology (CTIT).