Indeterministic Handling of Uncertain Decisions in Duplicate Detection

Fabian Panse, Maurice van Keulen, Norbert Ritter

Research output: Book/ReportReportProfessional

15 Downloads (Pure)

Abstract

In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way.
Original languageUndefined
Place of PublicationEnschede
PublisherCentre for Telematics and Information Technology (CTIT)
Number of pages12
Publication statusPublished - 4 Jun 2010

Publication series

NameCTIT Technical Report Series
PublisherCentre for Telematics and Information Technology, University of Twente
No.TR-CTIT-10-21
ISSN (Print)1381-3625

Keywords

  • METIS-270837
  • EWI-17967
  • DB-SDI: SCHEMA AND DATA INTEGRATION
  • IR-71703

Cite this

Panse, F., van Keulen, M., & Ritter, N. (2010). Indeterministic Handling of Uncertain Decisions in Duplicate Detection. (CTIT Technical Report Series; No. TR-CTIT-10-21). Enschede: Centre for Telematics and Information Technology (CTIT).
Panse, Fabian ; van Keulen, Maurice ; Ritter, Norbert. / Indeterministic Handling of Uncertain Decisions in Duplicate Detection. Enschede : Centre for Telematics and Information Technology (CTIT), 2010. 12 p. (CTIT Technical Report Series; TR-CTIT-10-21).
@book{5a9784d98612483ba1480153c1e7ff47,
title = "Indeterministic Handling of Uncertain Decisions in Duplicate Detection",
abstract = "In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way.",
keywords = "METIS-270837, EWI-17967, DB-SDI: SCHEMA AND DATA INTEGRATION, IR-71703",
author = "Fabian Panse and {van Keulen}, Maurice and Norbert Ritter",
year = "2010",
month = "6",
day = "4",
language = "Undefined",
series = "CTIT Technical Report Series",
publisher = "Centre for Telematics and Information Technology (CTIT)",
number = "TR-CTIT-10-21",
address = "Netherlands",

}

Panse, F, van Keulen, M & Ritter, N 2010, Indeterministic Handling of Uncertain Decisions in Duplicate Detection. CTIT Technical Report Series, no. TR-CTIT-10-21, Centre for Telematics and Information Technology (CTIT), Enschede.

Indeterministic Handling of Uncertain Decisions in Duplicate Detection. / Panse, Fabian; van Keulen, Maurice; Ritter, Norbert.

Enschede : Centre for Telematics and Information Technology (CTIT), 2010. 12 p. (CTIT Technical Report Series; No. TR-CTIT-10-21).

Research output: Book/ReportReportProfessional

TY - BOOK

T1 - Indeterministic Handling of Uncertain Decisions in Duplicate Detection

AU - Panse, Fabian

AU - van Keulen, Maurice

AU - Ritter, Norbert

PY - 2010/6/4

Y1 - 2010/6/4

N2 - In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way.

AB - In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way.

KW - METIS-270837

KW - EWI-17967

KW - DB-SDI: SCHEMA AND DATA INTEGRATION

KW - IR-71703

M3 - Report

T3 - CTIT Technical Report Series

BT - Indeterministic Handling of Uncertain Decisions in Duplicate Detection

PB - Centre for Telematics and Information Technology (CTIT)

CY - Enschede

ER -

Panse F, van Keulen M, Ritter N. Indeterministic Handling of Uncertain Decisions in Duplicate Detection. Enschede: Centre for Telematics and Information Technology (CTIT), 2010. 12 p. (CTIT Technical Report Series; TR-CTIT-10-21).