Detecting Mislabeled Data Using Supervised Machine Learning Techniques

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    3 Downloads (Pure)

    Abstract

    A lot of data sets, gathered for instance during user experiments, are contaminated with noise. Some noise in the measured features is not much of a problem, it even increases the performance of many Machine Learning (ML) techniques. But for noise in the labels (mislabeled data) the situation is quite different, label noise deteriorates the performance of all ML techniques. The research question addressed in this paper is to what extent can one detect mislabeled data using a committee of supervised Machine Learning models. The committee under consideration consists of a Bayesian model, Random Forest, Logistic classifier, a Neural Network and a Support Vector Machine. This committee is applied to a given data set in several iterations of 5-fold Cross validation. If a data sample is misclassified by all committee members in all iterations (consensus) then it is tagged as mislabeled. This approach was tested on the Iris plant data set, which is artificially contaminated with mislabeled data. For this data set the precision of detecting mislabeled samples is 100% and the recall is approximately 5%. The approach was also tested on the Touch data set, a data set of naturalistic social touch gestures. It is known that this data set contains mislabeled data, but the amount is unknown. For this data set the proposed method achieved a precision of 70% and for almost all other tagged samples the corresponding touch gesture deviated a lot from the prototypical touch gesture. Overall the proposed method shows high potential for detecting mislabeled samples, but the precision on other data sets needs to be investigated.
    Original languageEnglish
    Title of host publicationAugmented Cognition. Neurocognition and Machine Learning
    Subtitle of host publication11th International Conference, AC 2017, Held as Part of HCI International 2017, Vancouver, BC, Canada, July 9-14, 2017, Proceedings, Part I
    EditorsDylan D. Schmorrow, Cali M. Fidopiastis
    PublisherSpringer
    Pages571-581
    ISBN (Electronic)978-3-319-58628-1
    ISBN (Print)978-3-319-58627-4
    DOIs
    Publication statusPublished - 2017
    Event11th International Conference on Augmented Cognition 2017 - Vancouver Convention Centre, Vancouver, Canada
    Duration: 9 Jul 201714 Jul 2017
    Conference number: 11
    http://2017.hci.international/ac

    Conference

    Conference11th International Conference on Augmented Cognition 2017
    Abbreviated titleHCI
    CountryCanada
    CityVancouver
    Period9/07/1714/07/17
    Internet address

    Fingerprint

    Learning systems
    Labels
    Support vector machines
    Logistics
    Classifiers
    Neural networks
    Experiments

    Keywords

    • Mislabeled data
    • Supervised Machine Learning

    Cite this

    Poel, M. (2017). Detecting Mislabeled Data Using Supervised Machine Learning Techniques. In D. D. Schmorrow, & C. M. Fidopiastis (Eds.), Augmented Cognition. Neurocognition and Machine Learning: 11th International Conference, AC 2017, Held as Part of HCI International 2017, Vancouver, BC, Canada, July 9-14, 2017, Proceedings, Part I (pp. 571-581). Springer. https://doi.org/10.1007/978-3-319-58628-1_43
    Poel, Mannes . / Detecting Mislabeled Data Using Supervised Machine Learning Techniques. Augmented Cognition. Neurocognition and Machine Learning: 11th International Conference, AC 2017, Held as Part of HCI International 2017, Vancouver, BC, Canada, July 9-14, 2017, Proceedings, Part I. editor / Dylan D. Schmorrow ; Cali M. Fidopiastis. Springer, 2017. pp. 571-581
    @inproceedings{a11b3476b0ba41b48ba84b5c8b4e7138,
    title = "Detecting Mislabeled Data Using Supervised Machine Learning Techniques",
    abstract = "A lot of data sets, gathered for instance during user experiments, are contaminated with noise. Some noise in the measured features is not much of a problem, it even increases the performance of many Machine Learning (ML) techniques. But for noise in the labels (mislabeled data) the situation is quite different, label noise deteriorates the performance of all ML techniques. The research question addressed in this paper is to what extent can one detect mislabeled data using a committee of supervised Machine Learning models. The committee under consideration consists of a Bayesian model, Random Forest, Logistic classifier, a Neural Network and a Support Vector Machine. This committee is applied to a given data set in several iterations of 5-fold Cross validation. If a data sample is misclassified by all committee members in all iterations (consensus) then it is tagged as mislabeled. This approach was tested on the Iris plant data set, which is artificially contaminated with mislabeled data. For this data set the precision of detecting mislabeled samples is 100{\%} and the recall is approximately 5{\%}. The approach was also tested on the Touch data set, a data set of naturalistic social touch gestures. It is known that this data set contains mislabeled data, but the amount is unknown. For this data set the proposed method achieved a precision of 70{\%} and for almost all other tagged samples the corresponding touch gesture deviated a lot from the prototypical touch gesture. Overall the proposed method shows high potential for detecting mislabeled samples, but the precision on other data sets needs to be investigated.",
    keywords = "Mislabeled data, Supervised Machine Learning",
    author = "Mannes Poel",
    year = "2017",
    doi = "10.1007/978-3-319-58628-1_43",
    language = "English",
    isbn = "978-3-319-58627-4",
    pages = "571--581",
    editor = "Schmorrow, {Dylan D.} and Fidopiastis, {Cali M.}",
    booktitle = "Augmented Cognition. Neurocognition and Machine Learning",
    publisher = "Springer",

    }

    Poel, M 2017, Detecting Mislabeled Data Using Supervised Machine Learning Techniques. in DD Schmorrow & CM Fidopiastis (eds), Augmented Cognition. Neurocognition and Machine Learning: 11th International Conference, AC 2017, Held as Part of HCI International 2017, Vancouver, BC, Canada, July 9-14, 2017, Proceedings, Part I. Springer, pp. 571-581, 11th International Conference on Augmented Cognition 2017, Vancouver, Canada, 9/07/17. https://doi.org/10.1007/978-3-319-58628-1_43

    Detecting Mislabeled Data Using Supervised Machine Learning Techniques. / Poel, Mannes .

    Augmented Cognition. Neurocognition and Machine Learning: 11th International Conference, AC 2017, Held as Part of HCI International 2017, Vancouver, BC, Canada, July 9-14, 2017, Proceedings, Part I. ed. / Dylan D. Schmorrow; Cali M. Fidopiastis. Springer, 2017. p. 571-581.

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    TY - GEN

    T1 - Detecting Mislabeled Data Using Supervised Machine Learning Techniques

    AU - Poel, Mannes

    PY - 2017

    Y1 - 2017

    N2 - A lot of data sets, gathered for instance during user experiments, are contaminated with noise. Some noise in the measured features is not much of a problem, it even increases the performance of many Machine Learning (ML) techniques. But for noise in the labels (mislabeled data) the situation is quite different, label noise deteriorates the performance of all ML techniques. The research question addressed in this paper is to what extent can one detect mislabeled data using a committee of supervised Machine Learning models. The committee under consideration consists of a Bayesian model, Random Forest, Logistic classifier, a Neural Network and a Support Vector Machine. This committee is applied to a given data set in several iterations of 5-fold Cross validation. If a data sample is misclassified by all committee members in all iterations (consensus) then it is tagged as mislabeled. This approach was tested on the Iris plant data set, which is artificially contaminated with mislabeled data. For this data set the precision of detecting mislabeled samples is 100% and the recall is approximately 5%. The approach was also tested on the Touch data set, a data set of naturalistic social touch gestures. It is known that this data set contains mislabeled data, but the amount is unknown. For this data set the proposed method achieved a precision of 70% and for almost all other tagged samples the corresponding touch gesture deviated a lot from the prototypical touch gesture. Overall the proposed method shows high potential for detecting mislabeled samples, but the precision on other data sets needs to be investigated.

    AB - A lot of data sets, gathered for instance during user experiments, are contaminated with noise. Some noise in the measured features is not much of a problem, it even increases the performance of many Machine Learning (ML) techniques. But for noise in the labels (mislabeled data) the situation is quite different, label noise deteriorates the performance of all ML techniques. The research question addressed in this paper is to what extent can one detect mislabeled data using a committee of supervised Machine Learning models. The committee under consideration consists of a Bayesian model, Random Forest, Logistic classifier, a Neural Network and a Support Vector Machine. This committee is applied to a given data set in several iterations of 5-fold Cross validation. If a data sample is misclassified by all committee members in all iterations (consensus) then it is tagged as mislabeled. This approach was tested on the Iris plant data set, which is artificially contaminated with mislabeled data. For this data set the precision of detecting mislabeled samples is 100% and the recall is approximately 5%. The approach was also tested on the Touch data set, a data set of naturalistic social touch gestures. It is known that this data set contains mislabeled data, but the amount is unknown. For this data set the proposed method achieved a precision of 70% and for almost all other tagged samples the corresponding touch gesture deviated a lot from the prototypical touch gesture. Overall the proposed method shows high potential for detecting mislabeled samples, but the precision on other data sets needs to be investigated.

    KW - Mislabeled data

    KW - Supervised Machine Learning

    U2 - 10.1007/978-3-319-58628-1_43

    DO - 10.1007/978-3-319-58628-1_43

    M3 - Conference contribution

    SN - 978-3-319-58627-4

    SP - 571

    EP - 581

    BT - Augmented Cognition. Neurocognition and Machine Learning

    A2 - Schmorrow, Dylan D.

    A2 - Fidopiastis, Cali M.

    PB - Springer

    ER -

    Poel M. Detecting Mislabeled Data Using Supervised Machine Learning Techniques. In Schmorrow DD, Fidopiastis CM, editors, Augmented Cognition. Neurocognition and Machine Learning: 11th International Conference, AC 2017, Held as Part of HCI International 2017, Vancouver, BC, Canada, July 9-14, 2017, Proceedings, Part I. Springer. 2017. p. 571-581 https://doi.org/10.1007/978-3-319-58628-1_43