Efficient Web Harvesting Strategies for Monitoring Deep Web Content

Mohammadreza Khelghati, Djoerd Hiemstra, Maurice van Keulen

Research output: Book/ReportReportAcademic

2 Citations (Scopus)
224 Downloads (Pure)

Abstract

The change of the web content is rapid. In Focused Web Harvesting [?], which aims at achieving a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a complete set of related web data to their interesting topics. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.
Original languageUndefined
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages9
Publication statusPublished - 15 May 2016

Publication series

NameCTIT Technical Report Series
PublisherUniversity of Twente, Centre for Telematics and Information Technology (CTIT)
No.TR-CTIT-16-05
ISSN (Print)1381-3625

Keywords

  • EWI-27136
  • METIS-318486
  • IR-101069

Cite this