Duplicate Detection in Probabilistic Data

Fabian Panse, Maurice van Keulen, Ander de Keijzer, Norbert Ritter

Research output: Book/ReportReportProfessional

81 Downloads (Pure)

Abstract

Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data.
Original languageUndefined
Place of PublicationEnschede
PublisherDatabases (DB)
Number of pages8
Publication statusPublished - Dec 2009

Publication series

NameCTIT Technical Report Series
PublisherCentre for Telematics and Information Technology, University of Twente
No.TR-CTIT-09-44
ISSN (Print)1381-3625

Keywords

  • DB-SDI: SCHEMA AND DATA INTEGRATION
  • METIS-265252
  • EWI-17086
  • IR-69305

Cite this

Panse, F., van Keulen, M., de Keijzer, A., & Ritter, N. (2009). Duplicate Detection in Probabilistic Data. (CTIT Technical Report Series; No. TR-CTIT-09-44). Enschede: Databases (DB).
Panse, Fabian ; van Keulen, Maurice ; de Keijzer, Ander ; Ritter, Norbert. / Duplicate Detection in Probabilistic Data. Enschede : Databases (DB), 2009. 8 p. (CTIT Technical Report Series; TR-CTIT-09-44).
@book{806787fc66ce441b93348d1bf26b1ba4,
title = "Duplicate Detection in Probabilistic Data",
abstract = "Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data.",
keywords = "DB-SDI: SCHEMA AND DATA INTEGRATION, METIS-265252, EWI-17086, IR-69305",
author = "Fabian Panse and {van Keulen}, Maurice and {de Keijzer}, Ander and Norbert Ritter",
note = "Extended version of NTII2010 workshop paper.",
year = "2009",
month = "12",
language = "Undefined",
series = "CTIT Technical Report Series",
publisher = "Databases (DB)",
number = "TR-CTIT-09-44",

}

Panse, F, van Keulen, M, de Keijzer, A & Ritter, N 2009, Duplicate Detection in Probabilistic Data. CTIT Technical Report Series, no. TR-CTIT-09-44, Databases (DB), Enschede.

Duplicate Detection in Probabilistic Data. / Panse, Fabian; van Keulen, Maurice; de Keijzer, Ander; Ritter, Norbert.

Enschede : Databases (DB), 2009. 8 p. (CTIT Technical Report Series; No. TR-CTIT-09-44).

Research output: Book/ReportReportProfessional

TY - BOOK

T1 - Duplicate Detection in Probabilistic Data

AU - Panse, Fabian

AU - van Keulen, Maurice

AU - de Keijzer, Ander

AU - Ritter, Norbert

N1 - Extended version of NTII2010 workshop paper.

PY - 2009/12

Y1 - 2009/12

N2 - Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data.

AB - Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data.

KW - DB-SDI: SCHEMA AND DATA INTEGRATION

KW - METIS-265252

KW - EWI-17086

KW - IR-69305

M3 - Report

T3 - CTIT Technical Report Series

BT - Duplicate Detection in Probabilistic Data

PB - Databases (DB)

CY - Enschede

ER -

Panse F, van Keulen M, de Keijzer A, Ritter N. Duplicate Detection in Probabilistic Data. Enschede: Databases (DB), 2009. 8 p. (CTIT Technical Report Series; TR-CTIT-09-44).