Reliability measurement without limits

Dennis Reidsma, J. Carletta

Research output: Contribution to journalArticleAcademicpeer-review

42 Citations (Scopus)
30 Downloads (Pure)

Abstract

In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.
Original languageUndefined
Article number10.1162/coli.2008.34.3.319
Pages (from-to)319-326
Number of pages8
JournalComputational linguistics
Volume34
Issue number302/3
DOIs
Publication statusPublished - Sep 2008

Keywords

  • EC Grant Agreement nr.: FP6/033812
  • EWI-12915
  • IR-64823
  • HMI-SLT: Speech and Language Technology
  • METIS-251025
  • HMI-MI: MULTIMODAL INTERACTIONS

Cite this

Reidsma, D., & Carletta, J. (2008). Reliability measurement without limits. Computational linguistics, 34(302/3), 319-326. [10.1162/coli.2008.34.3.319]. https://doi.org/10.1162/coli.2008.34.3.319
Reidsma, Dennis ; Carletta, J. / Reliability measurement without limits. In: Computational linguistics. 2008 ; Vol. 34, No. 302/3. pp. 319-326.
@article{141e68f4cdd647c8b00277f699fbe10e,
title = "Reliability measurement without limits",
abstract = "In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.",
keywords = "EC Grant Agreement nr.: FP6/033812, EWI-12915, IR-64823, HMI-SLT: Speech and Language Technology, METIS-251025, HMI-MI: MULTIMODAL INTERACTIONS",
author = "Dennis Reidsma and J. Carletta",
note = "10.1162/coli.2008.34.3.319",
year = "2008",
month = "9",
doi = "10.1162/coli.2008.34.3.319",
language = "Undefined",
volume = "34",
pages = "319--326",
journal = "Computational linguistics",
issn = "0891-2017",
publisher = "MIT Press Journals",
number = "302/3",

}

Reidsma, D & Carletta, J 2008, 'Reliability measurement without limits' Computational linguistics, vol. 34, no. 302/3, 10.1162/coli.2008.34.3.319, pp. 319-326. https://doi.org/10.1162/coli.2008.34.3.319

Reliability measurement without limits. / Reidsma, Dennis; Carletta, J.

In: Computational linguistics, Vol. 34, No. 302/3, 10.1162/coli.2008.34.3.319, 09.2008, p. 319-326.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Reliability measurement without limits

AU - Reidsma, Dennis

AU - Carletta, J.

N1 - 10.1162/coli.2008.34.3.319

PY - 2008/9

Y1 - 2008/9

N2 - In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.

AB - In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.

KW - EC Grant Agreement nr.: FP6/033812

KW - EWI-12915

KW - IR-64823

KW - HMI-SLT: Speech and Language Technology

KW - METIS-251025

KW - HMI-MI: MULTIMODAL INTERACTIONS

U2 - 10.1162/coli.2008.34.3.319

DO - 10.1162/coli.2008.34.3.319

M3 - Article

VL - 34

SP - 319

EP - 326

JO - Computational linguistics

JF - Computational linguistics

SN - 0891-2017

IS - 302/3

M1 - 10.1162/coli.2008.34.3.319

ER -