Probabilistic Data Integration

Research output: Contribution to conferencePosterOther research output

54 Downloads (Pure)

Abstract

Probabilistic data integration is a specific kind of data integration where integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation.
The approach is based on the view that data quality problems (as they occur in an integration process) can be modeled as uncertainty and this uncertainty is considered an important result of the integration process. In a sense, data quality problems arising during the data integration process are not solved immediately, but explicitly represented in the resulting integrated data. This data can be stored in a probabilistic database to be queried directly resulting in possible or approximate answers. A probabilistic database is a specific kind of DBMS that allows storage, querying and manipulation of uncertain data. It keeps track of alternatives and dependencies among them.
While traditional data integration methods more or less explicitly consider uncertainty as a problem, as something to be avoided, probabilistic data integration treats uncertainty as an additional source of information, which is precious and should be preserved. It effectively allows for postponement of solving data integration problems. When combined with an effective method for data quality measurement, it also has the potential to allow for a pay- as-you-go and good-is-good-enough approach where small iterations reduce overall effort in improving the data quality of the integrated result.
In this presentation, we give an overview of various data integration problems and how a probabilistic approach can improve them, for example, entity resolution and merging of grouping data. We furthermore illustrate how probabilistic data integration as an application asks for more theoretical research on probabilistic database technology, such as more expressive data models and (ap- proximate) querying formalisms. In particular, we present the problem of incorporation of a restricted notion of higher orderedness in datalog without loosing its important properties
Original languageEnglish
Number of pages1
Publication statusPublished - 3 Nov 2017

Fingerprint

Data integration
Merging
Data structures
Uncertainty

Cite this

@conference{95c4522859584a03b3581dcea8f8dfd0,
title = "Probabilistic Data Integration",
abstract = "Probabilistic data integration is a specific kind of data integration where integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation.The approach is based on the view that data quality problems (as they occur in an integration process) can be modeled as uncertainty and this uncertainty is considered an important result of the integration process. In a sense, data quality problems arising during the data integration process are not solved immediately, but explicitly represented in the resulting integrated data. This data can be stored in a probabilistic database to be queried directly resulting in possible or approximate answers. A probabilistic database is a specific kind of DBMS that allows storage, querying and manipulation of uncertain data. It keeps track of alternatives and dependencies among them.While traditional data integration methods more or less explicitly consider uncertainty as a problem, as something to be avoided, probabilistic data integration treats uncertainty as an additional source of information, which is precious and should be preserved. It effectively allows for postponement of solving data integration problems. When combined with an effective method for data quality measurement, it also has the potential to allow for a pay- as-you-go and good-is-good-enough approach where small iterations reduce overall effort in improving the data quality of the integrated result.In this presentation, we give an overview of various data integration problems and how a probabilistic approach can improve them, for example, entity resolution and merging of grouping data. We furthermore illustrate how probabilistic data integration as an application asks for more theoretical research on probabilistic database technology, such as more expressive data models and (ap- proximate) querying formalisms. In particular, we present the problem of incorporation of a restricted notion of higher orderedness in datalog without loosing its important properties",
author = "{van Keulen}, Maurice",
year = "2017",
month = "11",
day = "3",
language = "English",

}

Probabilistic Data Integration. / van Keulen, Maurice .

2017.

Research output: Contribution to conferencePosterOther research output

TY - CONF

T1 - Probabilistic Data Integration

AU - van Keulen, Maurice

PY - 2017/11/3

Y1 - 2017/11/3

N2 - Probabilistic data integration is a specific kind of data integration where integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation.The approach is based on the view that data quality problems (as they occur in an integration process) can be modeled as uncertainty and this uncertainty is considered an important result of the integration process. In a sense, data quality problems arising during the data integration process are not solved immediately, but explicitly represented in the resulting integrated data. This data can be stored in a probabilistic database to be queried directly resulting in possible or approximate answers. A probabilistic database is a specific kind of DBMS that allows storage, querying and manipulation of uncertain data. It keeps track of alternatives and dependencies among them.While traditional data integration methods more or less explicitly consider uncertainty as a problem, as something to be avoided, probabilistic data integration treats uncertainty as an additional source of information, which is precious and should be preserved. It effectively allows for postponement of solving data integration problems. When combined with an effective method for data quality measurement, it also has the potential to allow for a pay- as-you-go and good-is-good-enough approach where small iterations reduce overall effort in improving the data quality of the integrated result.In this presentation, we give an overview of various data integration problems and how a probabilistic approach can improve them, for example, entity resolution and merging of grouping data. We furthermore illustrate how probabilistic data integration as an application asks for more theoretical research on probabilistic database technology, such as more expressive data models and (ap- proximate) querying formalisms. In particular, we present the problem of incorporation of a restricted notion of higher orderedness in datalog without loosing its important properties

AB - Probabilistic data integration is a specific kind of data integration where integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation.The approach is based on the view that data quality problems (as they occur in an integration process) can be modeled as uncertainty and this uncertainty is considered an important result of the integration process. In a sense, data quality problems arising during the data integration process are not solved immediately, but explicitly represented in the resulting integrated data. This data can be stored in a probabilistic database to be queried directly resulting in possible or approximate answers. A probabilistic database is a specific kind of DBMS that allows storage, querying and manipulation of uncertain data. It keeps track of alternatives and dependencies among them.While traditional data integration methods more or less explicitly consider uncertainty as a problem, as something to be avoided, probabilistic data integration treats uncertainty as an additional source of information, which is precious and should be preserved. It effectively allows for postponement of solving data integration problems. When combined with an effective method for data quality measurement, it also has the potential to allow for a pay- as-you-go and good-is-good-enough approach where small iterations reduce overall effort in improving the data quality of the integrated result.In this presentation, we give an overview of various data integration problems and how a probabilistic approach can improve them, for example, entity resolution and merging of grouping data. We furthermore illustrate how probabilistic data integration as an application asks for more theoretical research on probabilistic database technology, such as more expressive data models and (ap- proximate) querying formalisms. In particular, we present the problem of incorporation of a restricted notion of higher orderedness in datalog without loosing its important properties

M3 - Poster

ER -