Efficient Web Harvesting Strategies for Monitoring Deep Web Content

Mohammadreza Khelghati, Djoerd Hiemstra, Maurice van Keulen

Research output: Book/ReportReportAcademic

49 Downloads (Pure)

Abstract

The change of the web content is rapid. In Focused Web Harvesting [?], which aims at achieving a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a complete set of related web data to their interesting topics. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.
Original languageUndefined
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages9
Publication statusPublished - 15 May 2016

Publication series

NameCTIT Technical Report Series
PublisherUniversity of Twente, Centre for Telematics and Information Technology (CTIT)
No.TR-CTIT-16-05
ISSN (Print)1381-3625

Keywords

  • EWI-27136
  • METIS-318486
  • IR-101069

Cite this

Khelghati, M., Hiemstra, D., & van Keulen, M. (2016). Efficient Web Harvesting Strategies for Monitoring Deep Web Content. (CTIT Technical Report Series; No. TR-CTIT-16-05). Enschede: Centre for Telematics and Information Technology (CTIT).
Khelghati, Mohammadreza ; Hiemstra, Djoerd ; van Keulen, Maurice. / Efficient Web Harvesting Strategies for Monitoring Deep Web Content. Enschede : Centre for Telematics and Information Technology (CTIT), 2016. 9 p. (CTIT Technical Report Series; TR-CTIT-16-05).
@book{f87404bf31744ca59d7277927e9db44c,
title = "Efficient Web Harvesting Strategies for Monitoring Deep Web Content",
abstract = "The change of the web content is rapid. In Focused Web Harvesting [?], which aims at achieving a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a complete set of related web data to their interesting topics. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.",
keywords = "EWI-27136, METIS-318486, IR-101069",
author = "Mohammadreza Khelghati and Djoerd Hiemstra and {van Keulen}, Maurice",
note = "eemcs-eprint-27136",
year = "2016",
month = "5",
day = "15",
language = "Undefined",
series = "CTIT Technical Report Series",
publisher = "Centre for Telematics and Information Technology (CTIT)",
number = "TR-CTIT-16-05",
address = "Netherlands",

}

Khelghati, M, Hiemstra, D & van Keulen, M 2016, Efficient Web Harvesting Strategies for Monitoring Deep Web Content. CTIT Technical Report Series, no. TR-CTIT-16-05, Centre for Telematics and Information Technology (CTIT), Enschede.

Efficient Web Harvesting Strategies for Monitoring Deep Web Content. / Khelghati, Mohammadreza; Hiemstra, Djoerd; van Keulen, Maurice.

Enschede : Centre for Telematics and Information Technology (CTIT), 2016. 9 p. (CTIT Technical Report Series; No. TR-CTIT-16-05).

Research output: Book/ReportReportAcademic

TY - BOOK

T1 - Efficient Web Harvesting Strategies for Monitoring Deep Web Content

AU - Khelghati, Mohammadreza

AU - Hiemstra, Djoerd

AU - van Keulen, Maurice

N1 - eemcs-eprint-27136

PY - 2016/5/15

Y1 - 2016/5/15

N2 - The change of the web content is rapid. In Focused Web Harvesting [?], which aims at achieving a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a complete set of related web data to their interesting topics. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.

AB - The change of the web content is rapid. In Focused Web Harvesting [?], which aims at achieving a complete harvest for a given topic, this dynamic nature of the web creates problems for users who need to access a complete set of related web data to their interesting topics. Whether you are a fan following your favourite artist, athlete or politician, or a journalist investigating a topic, you need to access all the information relevant to your topics of interest and keep it up-to-date over time. General search engines like Google apply different techniques to enhance the freshness of their crawled data. However, in Focused Web Harvesting, we lack an efficient approach that detects changes of the content for a given topic over time. In this paper, we focus on techniques that allow us to keep the content relevant to a given entity up-to-date. To do so, we introduce approaches to efficiently harvest all the new and changed documents matching a given entity by querying a web search engine. One of our proposed approaches outperform the baseline and other approaches in finding the changed content on the web for a given entity with at least an average of 20 percent better performance.

KW - EWI-27136

KW - METIS-318486

KW - IR-101069

M3 - Report

T3 - CTIT Technical Report Series

BT - Efficient Web Harvesting Strategies for Monitoring Deep Web Content

PB - Centre for Telematics and Information Technology (CTIT)

CY - Enschede

ER -

Khelghati M, Hiemstra D, van Keulen M. Efficient Web Harvesting Strategies for Monitoring Deep Web Content. Enschede: Centre for Telematics and Information Technology (CTIT), 2016. 9 p. (CTIT Technical Report Series; TR-CTIT-16-05).