Researchers who make use of multimodal annotated corpora are always presented with something of a dilemma. On the one hand, one would prefer to have research results that are reproducible and independent of the particular annotators who produced the corpus that was used to obtain the results. A low level of inter-annotator agreement achieved on an annotation task implies a risk that this requirement is not met, especially if any disagreement between annotators was caused by them making errors in their task. On the other hand, many very interesting research issues concern phenomena for which annotation is an inherently subjective task. The judgements required of the annotators are then heavily dependent on the personal way in which the annotator views and interprets certain communicative behavior. In that case, the research results may become less easily reproducible, and certainly are no longer independent of the particular annotators who produced the corpus. The usual practice in assessing whether a corpus is fit for the purpose for which it was constructed is to calculate the level of inter-annotator agreement, and when it exceeds a certain fixed threshold the data is considered to be of tolerable quality. There are two problems with this approach. Firstly, it depends on the assumption that any disagreement in the data is not systematic, but looks like noise. This assumption may not always be warranted. Secondly, the approach is not well suited for annotations that are subjective to a certain degree, as in that case annotator disagreement is (partly) an inherent property of the annotation, expressing something about the level of intersubjectivity between annotators in how they interpret certain communicative behavior versus the amount of idiosyncrasy in their judgements with respect to this behavior. This thesis addresses both problems. In the theoretical part, it is shown that when disagreement is systematic, obtaining a certain level of inter-annotator agreement may indeed not be enough of a guarantee for the data being fit for its purpose. Simulations are used to investigate the effect of systematic disagreement on the relation between the level of inter-annotator agreement and the validity of machinelearning results obtained on the data. In the practical part, two new methods are explored for working with data that has been annotated with a low level of interannotator agreement. One method is aimed at finding a subset of the annotations that has been annotated more reliably, in a way that makes it possible to determine for new, unseen data whether it should belong to this subset — and therefore, whether a classifier trained on this more reliable subset is qualified to make a judgement for the new data. The other method is designed to use machine learning for explicitly modeling the overlap and disjunctions in the judgements of different notators. Both methods put together should make it possible to build classifiers that, when deployed in a practical application, yield decisions that make sense for the human end user of the application, who indeed also may have his or her own way of interpreting the communicative behavior that is subjected to the classifier.
|Award date||9 Oct 2008|
|Place of Publication||Enschede|
|Publication status||Published - 9 Oct 2008|
- EC Grant Agreement nr.: FP6/506811
- EC Grant Agreement nr.: FP6/033812
- HMI-MI: MULTIMODAL INTERACTIONS
- HMI-IE: Information Engineering