Indeterministic Handling of Uncertain Decisions in Deduplication

Fabian Panse, Norbert Ritter, Maurice van Keulen

Research output: Contribution to journalArticle

  • 3 Citations

Abstract

In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.
LanguageEnglish
Pages9
Number of pages44
JournalACM journal of data and information quality
Volume4
Issue number2
DOIs
StatePublished - Mar 2013

Keywords

  • CR-H.2.8
  • Probabilistic Data
  • EWI-21610
  • IR-79936
  • Uncertainty
  • METIS-296040
  • Deduplication

Cite this

@article{0dc0a7fcb9594bf8ba324eabcea6a8f5,
title = "Indeterministic Handling of Uncertain Decisions in Deduplication",
abstract = "In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.",
keywords = "CR-H.2.8, Probabilistic Data, EWI-21610, IR-79936, Uncertainty, METIS-296040, Deduplication",
author = "Fabian Panse and Norbert Ritter and {van Keulen}, Maurice",
year = "2013",
month = "3",
doi = "10.1145/2435221.2435225",
language = "English",
volume = "4",
pages = "9",
journal = "ACM journal of data and information quality",
issn = "1936-1955",
publisher = "Association for Computing Machinery",
number = "2",

}

Indeterministic Handling of Uncertain Decisions in Deduplication. / Panse, Fabian; Ritter, Norbert; van Keulen, Maurice.

In: ACM journal of data and information quality, Vol. 4, No. 2, 03.2013, p. 9.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Indeterministic Handling of Uncertain Decisions in Deduplication

AU - Panse,Fabian

AU - Ritter,Norbert

AU - van Keulen,Maurice

PY - 2013/3

Y1 - 2013/3

N2 - In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.

AB - In current research and practice, deduplication is usually considered as a deterministic approach in which tuples are either declared to be duplicates or not. In ambiguous situations, however, it is often not completely clear-cut, which tuples represent the same real-world entity. In deterministic approaches, many realistic possibilities may be ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.

KW - CR-H.2.8

KW - Probabilistic Data

KW - EWI-21610

KW - IR-79936

KW - Uncertainty

KW - METIS-296040

KW - Deduplication

U2 - 10.1145/2435221.2435225

DO - 10.1145/2435221.2435225

M3 - Article

VL - 4

SP - 9

JO - ACM journal of data and information quality

T2 - ACM journal of data and information quality

JF - ACM journal of data and information quality

SN - 1936-1955

IS - 2

ER -