Reliability measurement without limits

Dennis Reidsma, J. Carletta

    Research output: Contribution to journalArticleAcademicpeer-review

    65 Citations (Scopus)
    103 Downloads (Pure)


    In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.
    Original languageUndefined
    Article number10.1162/coli.2008.34.3.319
    Pages (from-to)319-326
    Number of pages8
    JournalComputational linguistics
    Issue number302/3
    Publication statusPublished - Sep 2008


    • EC Grant Agreement nr.: FP6/033812
    • EWI-12915
    • IR-64823
    • HMI-SLT: Speech and Language Technology
    • METIS-251025

    Cite this