The Influence of Basic Tokenization on Biomedical Document Retrieval

Rudolf Berend Trieschnigg, Wessel Kraaij, Franciska M.G. de Jong

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

17 Citations (Scopus)
23 Downloads (Pure)

Abstract

Tokenization is a fundamental preprocessing step in Information Retrieval systems in which text is turned into index terms. This paper quantifies and compares the influence of various simple tokenization techniques on document retrieval effectiveness in two domains: biomedicine and news. As expected, biomedical retrieval is more sensitive to small changes in the tokenization method. The tokenization strategy can make the difference between a mediocre and well performing IR system, especially in the biomedical domain.
Original languageUndefined
Title of host publicationProceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Place of PublicationNew York, NY, USA
PublisherACM Press
Pages803-804
Number of pages2
ISBN (Print)978-1-59593-597-7
DOIs
Publication statusPublished - 2007
Event30th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007 - Amsterdam, Netherlands
Duration: 23 Jul 200727 Jul 2007
Conference number: 30

Publication series

Name
PublisherACM Press
NumberLNCS4549

Conference

Conference30th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007
Abbreviated titleSIGIR
Country/TerritoryNetherlands
CityAmsterdam
Period23/07/0727/07/07

Keywords

  • HMI-IE: Information Engineering
  • HMI-MR: MULTIMEDIA RETRIEVAL
  • METIS-241899
  • IR-61906
  • EWI-11033
  • HMI-SLT: Speech and Language Technology

Cite this