Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)

Mohammadreza Khelghati, Maurice van Keulen, Djoerd Hiemstra

Research output: Book/ReportReportProfessional

33 Downloads (Pure)

Abstract

The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper.
Original languageUndefined
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages21
Publication statusPublished - 16 Jun 2014

Publication series

NameCTIT Technical Report Series
PublisherUniversity of Twente, Centre for Telematics and Information Technology (CTIT)
No.TR-CTIT-14-08
ISSN (Print)1381-3625

Keywords

  • Harvester Design Framework
  • Harvestability Factor
  • Deep Web Harvester
  • Deep Web
  • IR-91282
  • METIS-304120
  • EWI-24806
  • DB-IR: INFORMATION RETRIEVAL

Cite this

Khelghati, M., van Keulen, M., & Hiemstra, D. (2014). Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF). (CTIT Technical Report Series; No. TR-CTIT-14-08). Enschede: Centre for Telematics and Information Technology (CTIT).
Khelghati, Mohammadreza ; van Keulen, Maurice ; Hiemstra, Djoerd. / Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF). Enschede : Centre for Telematics and Information Technology (CTIT), 2014. 21 p. (CTIT Technical Report Series; TR-CTIT-14-08).
@book{028a9159609642fd8d9349dee359119f,
title = "Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)",
abstract = "The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper.",
keywords = "Harvester Design Framework, Harvestability Factor, Deep Web Harvester, Deep Web, IR-91282, METIS-304120, EWI-24806, DB-IR: INFORMATION RETRIEVAL",
author = "Mohammadreza Khelghati and {van Keulen}, Maurice and Djoerd Hiemstra",
year = "2014",
month = "6",
day = "16",
language = "Undefined",
series = "CTIT Technical Report Series",
publisher = "Centre for Telematics and Information Technology (CTIT)",
number = "TR-CTIT-14-08",
address = "Netherlands",

}

Khelghati, M, van Keulen, M & Hiemstra, D 2014, Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF). CTIT Technical Report Series, no. TR-CTIT-14-08, Centre for Telematics and Information Technology (CTIT), Enschede.

Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF). / Khelghati, Mohammadreza; van Keulen, Maurice; Hiemstra, Djoerd.

Enschede : Centre for Telematics and Information Technology (CTIT), 2014. 21 p. (CTIT Technical Report Series; No. TR-CTIT-14-08).

Research output: Book/ReportReportProfessional

TY - BOOK

T1 - Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)

AU - Khelghati, Mohammadreza

AU - van Keulen, Maurice

AU - Hiemstra, Djoerd

PY - 2014/6/16

Y1 - 2014/6/16

N2 - The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper.

AB - The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper.

KW - Harvester Design Framework

KW - Harvestability Factor

KW - Deep Web Harvester

KW - Deep Web

KW - IR-91282

KW - METIS-304120

KW - EWI-24806

KW - DB-IR: INFORMATION RETRIEVAL

M3 - Report

T3 - CTIT Technical Report Series

BT - Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)

PB - Centre for Telematics and Information Technology (CTIT)

CY - Enschede

ER -

Khelghati M, van Keulen M, Hiemstra D. Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF). Enschede: Centre for Telematics and Information Technology (CTIT), 2014. 21 p. (CTIT Technical Report Series; TR-CTIT-14-08).