Parsimonious Language Models for a Terabyte of Text

Djoerd Hiemstra, Jaap Kamps, Rianne Kaptein, Rongmei Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademic

59 Downloads (Pure)


The aims of this paper are twofold. Our first aim is to compare results of the earlier Terabyte tracks to the Million Query track. We submitted a number of runs using different document representations (such as full-text, title-fields, or incoming anchor-texts) to increase pool diversity. The initial results show broad agreement in system rankings over various measures on topic sets judged at both Terabyte and Million Query tracks, with runs using the full-text index giving superior results on all measures, but also some noteworthy upsets. Our second aim is to explore the use of parsimonious language models for retrieval on terabyte-scale collections. These models are smaller thus more efficient than the standard language models when used at indexing time, and they may also improve retrieval performance. We have conducted initial experiments using parsimonious models in combination with pseudo-relevance feedback, for both the Terabyte and Million Query track topic sets, and obtained promising initial results.
Original languageEnglish
Title of host publicationThe Sixteenth Text REtrieval Conference (TREC 2007) Proceedings
Place of PublicationWashington, DC
PublisherNational Institute of Standards and Technology
Number of pages7
Publication statusPublished - 2008
Event16th Text REtrieval Conference, TREC 2007 - Gaithersburg, United States
Duration: 6 Nov 20079 Nov 2007
Conference number: 16

Publication series

NameNIST Special Publication
PublisherUS National Institute of Standards and Technology (NIST)


Conference16th Text REtrieval Conference, TREC 2007
Abbreviated titleTREC
Country/TerritoryUnited States


  • IR-64757
  • METIS-250975
  • EWI-12720


Dive into the research topics of 'Parsimonious Language Models for a Terabyte of Text'. Together they form a unique fingerprint.

Cite this