Query-Based Sampling: Can we do Better than Random?

A.S. Tigelaar, Djoerd Hiemstra

Research output: Book/ReportReportProfessional

8 Downloads (Pure)

Abstract

Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.
Original languageUndefined
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages8
Publication statusPublished - 1 Feb 2010

Publication series

NameCTIT Technical Report Series
No.TR-CTIT-10-04
ISSN (Print)1381-3625

Keywords

  • METIS-270727
  • CR-H.3.3
  • CR-H.3.4
  • query-based sampling
  • EWI-17404
  • Distributed Information Retrieval

Cite this

Tigelaar, A. S., & Hiemstra, D. (2010). Query-Based Sampling: Can we do Better than Random? (CTIT Technical Report Series; No. TR-CTIT-10-04). Enschede: Centre for Telematics and Information Technology (CTIT).
Tigelaar, A.S. ; Hiemstra, Djoerd. / Query-Based Sampling: Can we do Better than Random?. Enschede : Centre for Telematics and Information Technology (CTIT), 2010. 8 p. (CTIT Technical Report Series; TR-CTIT-10-04).
@book{3f71b8659773432db46122946a28f517,
title = "Query-Based Sampling: Can we do Better than Random?",
abstract = "Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.",
keywords = "METIS-270727, CR-H.3.3, CR-H.3.4, query-based sampling, EWI-17404, Distributed Information Retrieval",
author = "A.S. Tigelaar and Djoerd Hiemstra",
year = "2010",
month = "2",
day = "1",
language = "Undefined",
series = "CTIT Technical Report Series",
publisher = "Centre for Telematics and Information Technology (CTIT)",
number = "TR-CTIT-10-04",
address = "Netherlands",

}

Tigelaar, AS & Hiemstra, D 2010, Query-Based Sampling: Can we do Better than Random? CTIT Technical Report Series, no. TR-CTIT-10-04, Centre for Telematics and Information Technology (CTIT), Enschede.

Query-Based Sampling: Can we do Better than Random? / Tigelaar, A.S.; Hiemstra, Djoerd.

Enschede : Centre for Telematics and Information Technology (CTIT), 2010. 8 p. (CTIT Technical Report Series; No. TR-CTIT-10-04).

Research output: Book/ReportReportProfessional

TY - BOOK

T1 - Query-Based Sampling: Can we do Better than Random?

AU - Tigelaar, A.S.

AU - Hiemstra, Djoerd

PY - 2010/2/1

Y1 - 2010/2/1

N2 - Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

AB - Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

KW - METIS-270727

KW - CR-H.3.3

KW - CR-H.3.4

KW - query-based sampling

KW - EWI-17404

KW - Distributed Information Retrieval

M3 - Report

T3 - CTIT Technical Report Series

BT - Query-Based Sampling: Can we do Better than Random?

PB - Centre for Telematics and Information Technology (CTIT)

CY - Enschede

ER -

Tigelaar AS, Hiemstra D. Query-Based Sampling: Can we do Better than Random? Enschede: Centre for Telematics and Information Technology (CTIT), 2010. 8 p. (CTIT Technical Report Series; TR-CTIT-10-04).