Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records

Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, Johannes H. Hegeman, Christin Seifert, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

1 Citation (Scopus)
357 Downloads (Pure)

Abstract

Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.
Original languageEnglish
Title of host publication2020 International Conference on Data Mining Workshops (ICDMW)
Place of PublicationPiscataway, NJ
PublisherIEEE
Pages644-650
Number of pages7
ISBN (Electronic)978-1-7281-9012-9
ISBN (Print)978-1-7281-9013-6
DOIs
Publication statusPublished - 20 Nov 2020
Event2020 International Conference on Data Mining Workshops, ICDMW 2020 - Sorrento, Italy
Duration: 17 Nov 202020 Nov 2020

Publication series

NameInternational Conference on Data Mining Workshops (ICDMW)
PublisherIEEE
Volume2020
ISSN (Print)2375-9232
ISSN (Electronic)2375-9259

Conference

Conference2020 International Conference on Data Mining Workshops, ICDMW 2020
Abbreviated titleICDMW
Country/TerritoryItaly
CitySorrento
Period17/11/2020/11/20

Keywords

  • Dictionaries
  • Annotations
  • Data models
  • Natural language processing
  • Data mining
  • Electronic medical records
  • Task analysis

Fingerprint

Dive into the research topics of 'Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records'. Together they form a unique fingerprint.

Cite this