Autoencoder-based cleaning in probabilistic databases

R.R. Mauritz, F.P.J. Nijweide, Jasper Goseling, Maurice van Keulen

Research output: Working paper

88 Downloads (Pure)

Abstract

In the field of data integration, data quality problems are often encountered when extracting, combining, and merging data. The probabilistic data integration approach represents information about such problems as uncertainties in a probabilistic database. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data to identify and correct doubtful values. A theoretical framework is provided, and experiments show that it can remove significant amounts of noise from categorical and numeric probabilistic data. Our method does not require clean data. We do, however, show that manually cleaning a small fraction of the data significantly improves performance.
Original languageEnglish
PublisherArXiv.org
Number of pages25
Publication statusPublished - 21 Jun 2021

Fingerprint

Dive into the research topics of 'Autoencoder-based cleaning in probabilistic databases'. Together they form a unique fingerprint.

Cite this