Abstract
Tokenization is a fundamental preprocessing step in Information Retrieval systems in which text is turned into index terms. This paper quantifies and compares the influence of various simple tokenization techniques on document retrieval effectiveness in two domains: biomedicine and news. As expected, biomedical retrieval is more sensitive to small changes in the tokenization method. The tokenization strategy can make the difference between a mediocre and well performing IR system, especially in the biomedical domain.
Original language | Undefined |
---|---|
Title of host publication | Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval |
Place of Publication | New York, NY, USA |
Publisher | ACM Press |
Pages | 803-804 |
Number of pages | 2 |
ISBN (Print) | 978-1-59593-597-7 |
DOIs | |
Publication status | Published - 2007 |
Event | 30th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007 - Amsterdam, Netherlands Duration: 23 Jul 2007 → 27 Jul 2007 Conference number: 30 |
Publication series
Name | |
---|---|
Publisher | ACM Press |
Number | LNCS4549 |
Conference
Conference | 30th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007 |
---|---|
Abbreviated title | SIGIR |
Country/Territory | Netherlands |
City | Amsterdam |
Period | 23/07/07 → 27/07/07 |
Keywords
- HMI-IE: Information Engineering
- HMI-MR: MULTIMEDIA RETRIEVAL
- METIS-241899
- IR-61906
- EWI-11033
- HMI-SLT: Speech and Language Technology