Annotations and Subjective Machines of Annotators, Embodied Agents, Users, and Other Humans

Abstract

Researchers who make use of multimodal annotated corpora are always presented with something of a dilemma. On the one hand, one would prefer to have research results that are reproducible and independent of the particular annotators who produced the corpus that was used to obtain the results. A low level of inter-annotator agreement achieved on an annotation task implies a risk that this requirement is not met, especially if any disagreement between annotators was caused by them making errors in their task. On the other hand, many very interesting research issues concern phenomena for which annotation is an inherently subjective task. The judgements required of the annotators are then heavily dependent on the personal way in which the annotator views and interprets certain communicative behavior. In that case, the research results may become less easily reproducible, and certainly are no longer independent of the particular annotators who produced the corpus. The usual practice in assessing whether a corpus is fit for the purpose for which it was constructed is to calculate the level of inter-annotator agreement, and when it exceeds a certain fixed threshold the data is considered to be of tolerable quality. There are two problems with this approach. Firstly, it depends on the assumption that any disagreement in the data is not systematic, but looks like noise. This assumption may not always be warranted. Secondly, the approach is not well suited for annotations that are subjective to a certain degree, as in that case annotator disagreement is (partly) an inherent property of the annotation, expressing something about the level of intersubjectivity between annotators in how they interpret certain communicative behavior versus the amount of idiosyncrasy in their judgements with respect to this behavior. This thesis addresses both problems. In the theoretical part, it is shown that when disagreement is systematic, obtaining a certain level of inter-annotator agreement may indeed not be enough of a guarantee for the data being fit for its purpose. Simulations are used to investigate the effect of systematic disagreement on the relation between the level of inter-annotator agreement and the validity of machinelearning results obtained on the data. In the practical part, two new methods are explored for working with data that has been annotated with a low level of interannotator agreement. One method is aimed at finding a subset of the annotations that has been annotated more reliably, in a way that makes it possible to determine for new, unseen data whether it should belong to this subset — and therefore, whether a classifier trained on this more reliable subset is qualified to make a judgement for the new data. The other method is designed to use machine learning for explicitly modeling the overlap and disjunctions in the judgements of different notators. Both methods put together should make it possible to build classifiers that, when deployed in a practical application, yield decisions that make sense for the human end user of the application, who indeed also may have his or her own way of interpreting the communicative behavior that is subjected to the classifier.
Original languageUndefined
Awarding Institution
  • University of Twente
Supervisors/Advisors
  • Nijholt, Antinus , Supervisor
  • op den Akker, Hendrikus J.A., Advisor
Sponsors
Date of Award9 Oct 2008
Place of PublicationEnschede
Print ISBNs978-90-365-2726-2
DOIs
StatePublished - 9 Oct 2008

Fingerprint

data
behavior
method
systematics
purpose
problem
intersubjectivity
research results
investment income
guarantee
noise
validity
error
simulation
human being
scientist
user
decision
quality
risk

Keywords

  • EC Grant Agreement nr.: FP6/506811
  • IR-59870
  • EWI-13612
  • EC Grant Agreement nr.: FP6/033812
  • HMI-MI: MULTIMODAL INTERACTIONS
  • METIS-252056
  • HMI-IE: Information Engineering

Cite this

@misc{308570a50e4c4a57a8840e29eea1637d,
title = "Annotations and Subjective Machines of Annotators, Embodied Agents, Users, and Other Humans",
abstract = "Researchers who make use of multimodal annotated corpora are always presented with something of a dilemma. On the one hand, one would prefer to have research results that are reproducible and independent of the particular annotators who produced the corpus that was used to obtain the results. A low level of inter-annotator agreement achieved on an annotation task implies a risk that this requirement is not met, especially if any disagreement between annotators was caused by them making errors in their task. On the other hand, many very interesting research issues concern phenomena for which annotation is an inherently subjective task. The judgements required of the annotators are then heavily dependent on the personal way in which the annotator views and interprets certain communicative behavior. In that case, the research results may become less easily reproducible, and certainly are no longer independent of the particular annotators who produced the corpus. The usual practice in assessing whether a corpus is fit for the purpose for which it was constructed is to calculate the level of inter-annotator agreement, and when it exceeds a certain fixed threshold the data is considered to be of tolerable quality. There are two problems with this approach. Firstly, it depends on the assumption that any disagreement in the data is not systematic, but looks like noise. This assumption may not always be warranted. Secondly, the approach is not well suited for annotations that are subjective to a certain degree, as in that case annotator disagreement is (partly) an inherent property of the annotation, expressing something about the level of intersubjectivity between annotators in how they interpret certain communicative behavior versus the amount of idiosyncrasy in their judgements with respect to this behavior. This thesis addresses both problems. In the theoretical part, it is shown that when disagreement is systematic, obtaining a certain level of inter-annotator agreement may indeed not be enough of a guarantee for the data being fit for its purpose. Simulations are used to investigate the effect of systematic disagreement on the relation between the level of inter-annotator agreement and the validity of machinelearning results obtained on the data. In the practical part, two new methods are explored for working with data that has been annotated with a low level of interannotator agreement. One method is aimed at finding a subset of the annotations that has been annotated more reliably, in a way that makes it possible to determine for new, unseen data whether it should belong to this subset — and therefore, whether a classifier trained on this more reliable subset is qualified to make a judgement for the new data. The other method is designed to use machine learning for explicitly modeling the overlap and disjunctions in the judgements of different notators. Both methods put together should make it possible to build classifiers that, when deployed in a practical application, yield decisions that make sense for the human end user of the application, who indeed also may have his or her own way of interpreting the communicative behavior that is subjected to the classifier.",
keywords = "EC Grant Agreement nr.: FP6/506811, IR-59870, EWI-13612, EC Grant Agreement nr.: FP6/033812, HMI-MI: MULTIMODAL INTERACTIONS, METIS-252056, HMI-IE: Information Engineering",
author = "Dennis Reidsma",
note = "10.3990/1.9789036527262",
year = "2008",
month = "10",
doi = "10.3990/1.9789036527262",
isbn = "978-90-365-2726-2",
school = "University of Twente",

}

Annotations and Subjective Machines of Annotators, Embodied Agents, Users, and Other Humans. / Reidsma, Dennis.

Enschede, 2008. 98 p.

Research output: ScientificPhD Thesis - Research UT, graduation UT

TY - THES

T1 - Annotations and Subjective Machines of Annotators, Embodied Agents, Users, and Other Humans

AU - Reidsma,Dennis

N1 - 10.3990/1.9789036527262

PY - 2008/10/9

Y1 - 2008/10/9

N2 - Researchers who make use of multimodal annotated corpora are always presented with something of a dilemma. On the one hand, one would prefer to have research results that are reproducible and independent of the particular annotators who produced the corpus that was used to obtain the results. A low level of inter-annotator agreement achieved on an annotation task implies a risk that this requirement is not met, especially if any disagreement between annotators was caused by them making errors in their task. On the other hand, many very interesting research issues concern phenomena for which annotation is an inherently subjective task. The judgements required of the annotators are then heavily dependent on the personal way in which the annotator views and interprets certain communicative behavior. In that case, the research results may become less easily reproducible, and certainly are no longer independent of the particular annotators who produced the corpus. The usual practice in assessing whether a corpus is fit for the purpose for which it was constructed is to calculate the level of inter-annotator agreement, and when it exceeds a certain fixed threshold the data is considered to be of tolerable quality. There are two problems with this approach. Firstly, it depends on the assumption that any disagreement in the data is not systematic, but looks like noise. This assumption may not always be warranted. Secondly, the approach is not well suited for annotations that are subjective to a certain degree, as in that case annotator disagreement is (partly) an inherent property of the annotation, expressing something about the level of intersubjectivity between annotators in how they interpret certain communicative behavior versus the amount of idiosyncrasy in their judgements with respect to this behavior. This thesis addresses both problems. In the theoretical part, it is shown that when disagreement is systematic, obtaining a certain level of inter-annotator agreement may indeed not be enough of a guarantee for the data being fit for its purpose. Simulations are used to investigate the effect of systematic disagreement on the relation between the level of inter-annotator agreement and the validity of machinelearning results obtained on the data. In the practical part, two new methods are explored for working with data that has been annotated with a low level of interannotator agreement. One method is aimed at finding a subset of the annotations that has been annotated more reliably, in a way that makes it possible to determine for new, unseen data whether it should belong to this subset — and therefore, whether a classifier trained on this more reliable subset is qualified to make a judgement for the new data. The other method is designed to use machine learning for explicitly modeling the overlap and disjunctions in the judgements of different notators. Both methods put together should make it possible to build classifiers that, when deployed in a practical application, yield decisions that make sense for the human end user of the application, who indeed also may have his or her own way of interpreting the communicative behavior that is subjected to the classifier.

AB - Researchers who make use of multimodal annotated corpora are always presented with something of a dilemma. On the one hand, one would prefer to have research results that are reproducible and independent of the particular annotators who produced the corpus that was used to obtain the results. A low level of inter-annotator agreement achieved on an annotation task implies a risk that this requirement is not met, especially if any disagreement between annotators was caused by them making errors in their task. On the other hand, many very interesting research issues concern phenomena for which annotation is an inherently subjective task. The judgements required of the annotators are then heavily dependent on the personal way in which the annotator views and interprets certain communicative behavior. In that case, the research results may become less easily reproducible, and certainly are no longer independent of the particular annotators who produced the corpus. The usual practice in assessing whether a corpus is fit for the purpose for which it was constructed is to calculate the level of inter-annotator agreement, and when it exceeds a certain fixed threshold the data is considered to be of tolerable quality. There are two problems with this approach. Firstly, it depends on the assumption that any disagreement in the data is not systematic, but looks like noise. This assumption may not always be warranted. Secondly, the approach is not well suited for annotations that are subjective to a certain degree, as in that case annotator disagreement is (partly) an inherent property of the annotation, expressing something about the level of intersubjectivity between annotators in how they interpret certain communicative behavior versus the amount of idiosyncrasy in their judgements with respect to this behavior. This thesis addresses both problems. In the theoretical part, it is shown that when disagreement is systematic, obtaining a certain level of inter-annotator agreement may indeed not be enough of a guarantee for the data being fit for its purpose. Simulations are used to investigate the effect of systematic disagreement on the relation between the level of inter-annotator agreement and the validity of machinelearning results obtained on the data. In the practical part, two new methods are explored for working with data that has been annotated with a low level of interannotator agreement. One method is aimed at finding a subset of the annotations that has been annotated more reliably, in a way that makes it possible to determine for new, unseen data whether it should belong to this subset — and therefore, whether a classifier trained on this more reliable subset is qualified to make a judgement for the new data. The other method is designed to use machine learning for explicitly modeling the overlap and disjunctions in the judgements of different notators. Both methods put together should make it possible to build classifiers that, when deployed in a practical application, yield decisions that make sense for the human end user of the application, who indeed also may have his or her own way of interpreting the communicative behavior that is subjected to the classifier.

KW - EC Grant Agreement nr.: FP6/506811

KW - IR-59870

KW - EWI-13612

KW - EC Grant Agreement nr.: FP6/033812

KW - HMI-MI: MULTIMODAL INTERACTIONS

KW - METIS-252056

KW - HMI-IE: Information Engineering

U2 - 10.3990/1.9789036527262

DO - 10.3990/1.9789036527262

M3 - PhD Thesis - Research UT, graduation UT

SN - 978-90-365-2726-2

ER -