In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.
- EC Grant Agreement nr.: FP6/033812
- HMI-SLT: Speech and Language Technology
- HMI-MI: MULTIMODAL INTERACTIONS