Source Selection for Inconsistency Detection

Inconsistencies in a database can be detected based on violations of integrity constraints, such as functional depencies (FDs). In big data era, many related data sources give us the chance of detecting inconsistency extensively. That is, even though violations do not exist in a single data set D, we can leverage other data sources to discover potential violations. A significant challenge for violation detection based on data sources is that accessing too many data sources introduces a huge cost, while involving too few data sources may miss serious violations. Motivated by this, we investigate how to select a proper subset of sources for inconsistency detection. To address this problem, we formulate the gain model of sources and introduce the optimization problem of source selection, called SSID, in which the gain is maximized with the cost under a threshold. We show that the SSID problem is NP-hard and propose a greedy approximation approach for SSID. To avoid accessing data sources, we also present a randomized technique for gain estimation with theoretical guarantees. Experimental results on both real and synthetic data show high performance on both effectiveness and efficiency of our algorithm.

[1]  Sunil Prabhakar,et al.  ERACER: a database approach for statistical inference and data cleaning , 2010, SIGMOD Conference.

[2]  Jeff M. Phillips,et al.  Chernoff-Hoeffding Inequality and Applications , 2012, ArXiv.

[3]  Divesh Srivastava,et al.  Characterizing and selecting fresh data sources , 2014, SIGMOD Conference.

[4]  Divesh Srivastava,et al.  Integrating Conflicting Data: The Role of Source Dependence , 2009, Proc. VLDB Endow..

[5]  Divesh Srivastava,et al.  Scaling up copy detection , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[6]  Divesh Srivastava,et al.  Finding Quality in Quantity: The Challenge of Discovering Valuable Sources for Integration , 2015, CIDR.

[7]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[8]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[9]  Jianzhong Li,et al.  Incremental Detection of Inconsistencies in Distributed Data , 2014, IEEE Trans. Knowl. Data Eng..

[10]  Edith Cohen,et al.  Finding Interesting Associations without Support Pruning , 2001, IEEE Trans. Knowl. Data Eng..

[11]  D. Hochbaum Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems , 1996 .

[12]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[13]  Divesh Srivastava,et al.  Fusing data with correlations , 2014, SIGMOD Conference.

[14]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[15]  Divesh Srivastava,et al.  Truth Discovery and Copying Detection in a Dynamic World , 2009, Proc. VLDB Endow..

[16]  Divesh Srivastava,et al.  Online Ordering of Overlapping Data Sources , 2013, Proc. VLDB Endow..

[17]  Divesh Srivastava,et al.  Truth Finding on the Deep Web: Is the Problem Solved? , 2012, Proc. VLDB Endow..

[18]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[19]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[20]  Shuai Ma,et al.  Detecting inconsistencies in distributed data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).