Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints

Martin Toepfer, Christin Seifert

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    1 Citation (Scopus)
    47 Downloads (Pure)

    Abstract

    Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the documentlevel. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.
    Original languageEnglish
    Title of host publicationDigital Libraries for Open Knowledge - 22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018, Proceedings
    EditorsEva Mendez, Cristina Ribeiro, Gabriel David, João Correia Lopes, Fabio Crestani
    Pages3-15
    Number of pages13
    DOIs
    Publication statusPublished - 2018
    Event22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018 - University of Porto, Faculty of Engineering, Porto, Portugal
    Duration: 10 Sep 201813 Sep 2018
    Conference number: 22
    http://www.tpdl.eu/tpdl2018/

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume11057 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018
    Abbreviated titleTPDL
    CountryPortugal
    CityPorto
    Period10/09/1813/09/18
    Internet address

    Fingerprint Dive into the research topics of 'Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints'. Together they form a unique fingerprint.

    Cite this