Leveraging Influence Functions for Dataset Exploration and Cleaning

Agustin Martin Picard Picard, David Vigouroux, Petr Zamolodtchikov, Quentin Vincenot, Jean-Michel Loubes, Edouard Pauwels

Research output: Contribution to conferencePaperpeer-review

57 Downloads (Pure)

Abstract

In this paper, we tackle the problem of finding potentially problematic samples and complex regions of the input space for large pools of data without any supervision, with the objective of being relayed to and validated by a domain expert. This information can be critical, as even a low level of noise in the dataset may severely bias the model through spurious correlations between unrelated samples, and under-represented groups of data-points will exacerbate this issue. As such, we present two practical applications of influence functions in neural network models to industrial use-cases: exploration and cleanup of mislabeled examples in datasets. This robust statistics tool allows us to approximately know how different an estimator might be if we slightly changed the training dataset. In particular, we apply this technique to an ACAS Xu neural network surrogate model use-case[14] for complex region exploration, and to the CIFAR-10 canonical RGB image classification problem[20] for mislabeled sample detection with promising results.
Original languageEnglish
Number of pages8
Publication statusPublished - 2022
Event11th European Congress Embedded Real Time Systems, ERTS 2022 - Toulouse, France
Duration: 1 Jun 20222 Jun 2022
Conference number: 11

Conference

Conference11th European Congress Embedded Real Time Systems, ERTS 2022
Abbreviated titleERTS 2022
Country/TerritoryFrance
CityToulouse
Period1/06/222/06/22

Fingerprint

Dive into the research topics of 'Leveraging Influence Functions for Dataset Exploration and Cleaning'. Together they form a unique fingerprint.

Cite this