MeSH Up: effective MeSH text classification for improved document retrieval

Rudolf Berend Trieschnigg, Piotr Pezik, Vivian Lee, Franciska M.G. de Jong, Wessel Kraaij, Dietrich Rebholz-Schuhmann

  • 73 Citations

Abstract

Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems. Results: We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone. Conclusions: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.
Original languageUndefined
Article number10.1093/bioinformatics/btp249
Pages (from-to)1412-1418
Number of pages7
JournalBioinformatics
Volume25
Issue number11
DOIs
StatePublished - 1 Jun 2009

Fingerprint

Thesauri
Medical Subject Headings
Information retrieval
Controlled Vocabulary
Information Storage and Retrieval
Vector spaces
Ontology
Genes
Space Simulation
Gene Ontology

Keywords

  • HMI-SLT: Speech and Language Technology
  • EWI-15378
  • Genomics Information Retrieval
  • METIS-263854
  • HMI-IE: Information Engineering
  • Text classification
  • EC Grant Agreement nr.: FP6/028099
  • IR-65496

Cite this

Trieschnigg, R. B., Pezik, P., Lee, V., de Jong, F. M. G., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11), 1412-1418. [10.1093/bioinformatics/btp249]. DOI: 10.1093/bioinformatics/btp249

Trieschnigg, Rudolf Berend; Pezik, Piotr; Lee, Vivian; de Jong, Franciska M.G.; Kraaij, Wessel; Rebholz-Schuhmann, Dietrich / MeSH Up: effective MeSH text classification for improved document retrieval.

In: Bioinformatics, Vol. 25, No. 11, 10.1093/bioinformatics/btp249, 01.06.2009, p. 1412-1418.

Research output: Scientific - peer-reviewArticle

@article{a22476cde72a41678597533fc9059e1b,
title = "MeSH Up: effective MeSH text classification for improved document retrieval",
abstract = "Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems. Results: We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone. Conclusions: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.",
keywords = "HMI-SLT: Speech and Language Technology, EWI-15378, Genomics Information Retrieval, METIS-263854, HMI-IE: Information Engineering, Text classification, EC Grant Agreement nr.: FP6/028099, IR-65496",
author = "Trieschnigg, {Rudolf Berend} and Piotr Pezik and Vivian Lee and {de Jong}, {Franciska M.G.} and Wessel Kraaij and Dietrich Rebholz-Schuhmann",
note = "10.1093/bioinformatics/btp249",
year = "2009",
month = "6",
doi = "10.1093/bioinformatics/btp249",
volume = "25",
pages = "1412--1418",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "11",

}

Trieschnigg, RB, Pezik, P, Lee, V, de Jong, FMG, Kraaij, W & Rebholz-Schuhmann, D 2009, 'MeSH Up: effective MeSH text classification for improved document retrieval' Bioinformatics, vol 25, no. 11, 10.1093/bioinformatics/btp249, pp. 1412-1418. DOI: 10.1093/bioinformatics/btp249

MeSH Up: effective MeSH text classification for improved document retrieval. / Trieschnigg, Rudolf Berend; Pezik, Piotr; Lee, Vivian; de Jong, Franciska M.G.; Kraaij, Wessel; Rebholz-Schuhmann, Dietrich.

In: Bioinformatics, Vol. 25, No. 11, 10.1093/bioinformatics/btp249, 01.06.2009, p. 1412-1418.

Research output: Scientific - peer-reviewArticle

TY - JOUR

T1 - MeSH Up: effective MeSH text classification for improved document retrieval

AU - Trieschnigg,Rudolf Berend

AU - Pezik,Piotr

AU - Lee,Vivian

AU - de Jong,Franciska M.G.

AU - Kraaij,Wessel

AU - Rebholz-Schuhmann,Dietrich

N1 - 10.1093/bioinformatics/btp249

PY - 2009/6/1

Y1 - 2009/6/1

N2 - Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems. Results: We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone. Conclusions: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.

AB - Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems. Results: We compare the performance of six MeSH classification systems [MetaMap, EAGL, a language and a vector space model-based approach, a K-Nearest Neighbor (KNN) approach and MTI] in terms of reproducing and complementing manual MeSH annotations. A KNN system clearly outperforms the other published approaches and scales well with large amounts of text using the full MeSH thesaurus. Our measurements demonstrate to what extent manual MeSH annotations can be reproduced and how they can be complemented by automatic annotations. We also show that a statistically significant improvement can be obtained in information retrieval (IR) when the text of a user's query is automatically annotated with MeSH concepts, compared to using the original textual query alone. Conclusions: The annotation of biomedical texts using controlled vocabularies such as MeSH can be automated to improve text-only IR. Furthermore, the automatic MeSH annotation system we propose is highly scalable and it generates improvements in IR comparable with those observed for manual annotations.

KW - HMI-SLT: Speech and Language Technology

KW - EWI-15378

KW - Genomics Information Retrieval

KW - METIS-263854

KW - HMI-IE: Information Engineering

KW - Text classification

KW - EC Grant Agreement nr.: FP6/028099

KW - IR-65496

U2 - 10.1093/bioinformatics/btp249

DO - 10.1093/bioinformatics/btp249

M3 - Article

VL - 25

SP - 1412

EP - 1418

JO - Bioinformatics

T2 - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 11

M1 - 10.1093/bioinformatics/btp249

ER -

Trieschnigg RB, Pezik P, Lee V, de Jong FMG, Kraaij W, Rebholz-Schuhmann D. MeSH Up: effective MeSH text classification for improved document retrieval. Bioinformatics. 2009 Jun 1;25(11):1412-1418. 10.1093/bioinformatics/btp249. Available from, DOI: 10.1093/bioinformatics/btp249