Abstract
In computational linguistics, a reliability measurement of 0.8 on some statistic such as $\kappa$ is widely thought to guarantee that hand-coded data is fit for purpose, with lower values suspect. We demonstrate that the main use of such data, machine learning, can tolerate data with a low reliability as long as any disagreement among human coders looks like random noise. When it does not, however, data can have a reliability of more than 0.8 and still be unsuitable for use: the disagreement may indicate erroneous patterns that machine-learning can learn, and evaluation against test data that contain these same erroneous patterns may lead us to draw wrong conclusions about our machine-learning algorithms. Furthermore, lower reliability values still held as acceptable by many researchers, between 0.67 and 0.8, may even yield inflated performance figures in some circumstances. Although this is a common sense result, it has implications for how we work that are likely to reach beyond the machine-learning applications we discuss. At the very least, computational linguists should look for any patterns in the disagreement among coders and assess what impact they will have.
Original language | Undefined |
---|---|
Article number | 10.1162/coli.2008.34.3.319 |
Pages (from-to) | 319-326 |
Number of pages | 8 |
Journal | Computational linguistics |
Volume | 34 |
Issue number | 302/3 |
DOIs | |
Publication status | Published - Sept 2008 |
Keywords
- EC Grant Agreement nr.: FP6/033812
- EWI-12915
- IR-64823
- HMI-SLT: Speech and Language Technology
- METIS-251025
- HMI-MI: MULTIMODAL INTERACTIONS