Resource Selection for Federated Search on the Web

Dong-Phuong Nguyen, Thomas Demeester, Rudolf Berend Trieschnigg, Djoerd Hiemstra

Research output: Book/ReportReportProfessional

16 Downloads (Pure)

Abstract

A publicly available dataset for federated search reflecting a real web environment has long been bsent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines. First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.
Original languageUndefined
Place of PublicationEnschede, The Netherlands
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages17
Publication statusPublished - Sep 2016

Publication series

NameCTIT Technical Report Series
PublisherUniversity of Twente, Centre for Telematics and Information Technology (CTIT)
No.TR-CTIT-16-12
ISSN (Print)1381-3625

Keywords

  • IR-101237
  • METIS-318510
  • EWI-27190
  • CR-H.3.3

Cite this

Nguyen, D-P., Demeester, T., Trieschnigg, R. B., & Hiemstra, D. (2016). Resource Selection for Federated Search on the Web. (CTIT Technical Report Series; No. TR-CTIT-16-12). Enschede, The Netherlands: Centre for Telematics and Information Technology (CTIT).
Nguyen, Dong-Phuong ; Demeester, Thomas ; Trieschnigg, Rudolf Berend ; Hiemstra, Djoerd. / Resource Selection for Federated Search on the Web. Enschede, The Netherlands : Centre for Telematics and Information Technology (CTIT), 2016. 17 p. (CTIT Technical Report Series; TR-CTIT-16-12).
@book{f07d3242a87c4bf6b1b1e0a6d18bfa11,
title = "Resource Selection for Federated Search on the Web",
abstract = "A publicly available dataset for federated search reflecting a real web environment has long been bsent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines. First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.",
keywords = "IR-101237, METIS-318510, EWI-27190, CR-H.3.3",
author = "Dong-Phuong Nguyen and Thomas Demeester and Trieschnigg, {Rudolf Berend} and Djoerd Hiemstra",
note = "eemcs-eprint-27190",
year = "2016",
month = "9",
language = "Undefined",
series = "CTIT Technical Report Series",
publisher = "Centre for Telematics and Information Technology (CTIT)",
number = "TR-CTIT-16-12",
address = "Netherlands",

}

Nguyen, D-P, Demeester, T, Trieschnigg, RB & Hiemstra, D 2016, Resource Selection for Federated Search on the Web. CTIT Technical Report Series, no. TR-CTIT-16-12, Centre for Telematics and Information Technology (CTIT), Enschede, The Netherlands.

Resource Selection for Federated Search on the Web. / Nguyen, Dong-Phuong; Demeester, Thomas; Trieschnigg, Rudolf Berend; Hiemstra, Djoerd.

Enschede, The Netherlands : Centre for Telematics and Information Technology (CTIT), 2016. 17 p. (CTIT Technical Report Series; No. TR-CTIT-16-12).

Research output: Book/ReportReportProfessional

TY - BOOK

T1 - Resource Selection for Federated Search on the Web

AU - Nguyen, Dong-Phuong

AU - Demeester, Thomas

AU - Trieschnigg, Rudolf Berend

AU - Hiemstra, Djoerd

N1 - eemcs-eprint-27190

PY - 2016/9

Y1 - 2016/9

N2 - A publicly available dataset for federated search reflecting a real web environment has long been bsent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines. First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.

AB - A publicly available dataset for federated search reflecting a real web environment has long been bsent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines. First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.

KW - IR-101237

KW - METIS-318510

KW - EWI-27190

KW - CR-H.3.3

M3 - Report

T3 - CTIT Technical Report Series

BT - Resource Selection for Federated Search on the Web

PB - Centre for Telematics and Information Technology (CTIT)

CY - Enschede, The Netherlands

ER -

Nguyen D-P, Demeester T, Trieschnigg RB, Hiemstra D. Resource Selection for Federated Search on the Web. Enschede, The Netherlands: Centre for Telematics and Information Technology (CTIT), 2016. 17 p. (CTIT Technical Report Series; TR-CTIT-16-12).