Reliability measurement without limits

Dennis Reidsma, J. Carletta

49 Citations (Scopus)

Abstract

In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.
Original language Undefined 10.1162/coli.2008.34.3.319 319-326 8 Computational linguistics 34 302/3 https://doi.org/10.1162/coli.2008.34.3.319 Published - Sep 2008

Keywords

• EC Grant Agreement nr.: FP6/033812
• EWI-12915
• IR-64823
• HMI-SLT: Speech and Language Technology
• METIS-251025
• HMI-MI: MULTIMODAL INTERACTIONS