Detecting Inconsistencies in Private Data with Secure Function Evaluation

Erroneous and inconsistent data, often referred to as ‘dirty data’, is a major worry for businesses. Prevalent techniques to improve data quality consist of discovering data quality rules, identifying records that violate those rules, and then modifying the data to either remove those violations. Most of the work described in the literature deals with cases where both the data and the rules are visible to the party that is in charge of cleaning the data. However, consider the case where two parties with data and data quality rules wish to cooperate in data cleaning under two restrictions: (1) neither of the parties is willing to share their data due to its sensitive nature, and (2) the data quality rules may reveal information about the content of the data and may be considered as a private asset to the business. The question then is how to clean the data without having to share the data or the rules. While the data cleaning process involves several phases, our focus in this paper is on detecting inconsistent data. We propose a novel inconsistency detection protocol that preserves the privacy of both the data and the data quality rules without the use of a third party. Inconsistent data is defined as all records in a database that violate some conditional functional dependencies or CFDs. Our approach is based primarily on the secure multiparty computation framework. We present complexity analysis of our protocol and a series of experiments about its performance.

[1]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[3]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[5]  Martin White,et al.  Enterprise information portals , 2000, Electron. Libr..

[6]  Elisa Bertino,et al.  A Hybrid Approach to Private Record Linkage , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Bei Yu,et al.  On generating near-optimal tableaux for conditional functional dependencies , 2008, Proc. VLDB Endow..

[8]  Benny Pinkas,et al.  Fairplay - Secure Two-Party Computation System , 2004, USENIX Security Symposium.

[9]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[10]  Rebecca N. Wright,et al.  Privacy-preserving imputation of missing data , 2008, Data Knowl. Eng..

[11]  Mikhail J. Atallah,et al.  Efficient Private Record Linkage , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Michael O. Rabin,et al.  How To Exchange Secrets with Oblivious Transfer , 2005, IACR Cryptol. ePrint Arch..

[13]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  Rafail Ostrovsky,et al.  Replication is not needed: single database, computationally-private information retrieval , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[15]  Elisa Bertino,et al.  Privacy preserving schema and data matching , 2007, SIGMOD '07.

[16]  Fabio Stella,et al.  A Privacy Preserving Framework for Accuracy and Completeness Quality Assessment , 2009 .

[17]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[18]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[19]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[20]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[21]  Lei Chen,et al.  Discovering matching dependencies , 2009, CIKM.

[22]  Vitaly Shmatikov,et al.  Privacy-preserving remote diagnostics , 2007, CCS '07.

[23]  Shuai Ma,et al.  Extending Dependencies with Conditions , 2007, VLDB.

[24]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[25]  Shuai Ma,et al.  Detecting inconsistencies in distributed data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[26]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.