Privacy Preserving Data Quality Assessment for High-Fidelity Data Sharing

In a data-driven economy that struggles to cope with the volume and diversity of information, data quality assessment has become a necessary precursor to data analytics. Real-world data often contains inconsistencies, conflicts and errors. Such dirty data increases processing costs and has a negative impact on analytics. Assessing the quality of a dataset is especially important when a party is considering acquisition of data held by an untrusted entity. In this scenario, it is necessary to consider privacy risks of the stakeholders. This paper examines challenges in privacy-preserving data quality assessment. A two-party scenario is considered, consisting of a client that wishes to test data quality and a server that holds the dataset. Privacy-preserving protocols are presented for testing important data quality metrics: completeness, consistency, uniqueness, timeliness and validity. For semi-honest parties, the protocols ensure that the client does not discover any information about the data other than the value of the quality metric. The server does not discover the parameters of the client's query, the specific attributes being tested and the computed value of the data quality metric. The proposed protocols employ additively homomorphic encryption in conjunction with condensed data representations such as counting hash tables and histograms, serving as efficient alternatives to solutions based on private set intersection.

[1]  Emiliano De Cristofaro,et al.  Privacy-Friendly Collaboration for Cyber Threat Mitigation , 2014, ArXiv.

[2]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[3]  Mariana Raykova,et al.  Scaling Private Set Intersection to Billion-Element Sets , 2014, Financial Cryptography.

[4]  Florian Kerschbaum,et al.  Public-Key Encrypted Bloom Filters with Applications to Supply Chain Integrity , 2011, DBSec.

[5]  Hovav Shacham,et al.  Compact Proofs of Retrievability , 2008, Journal of Cryptology.

[6]  Kimberly C. Claffy,et al.  Dialing Privacy and Utility: A Proposed Data-Sharing Framework to Advance Internet Research , 2010, IEEE Security & Privacy.

[7]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[8]  BurnsRandal,et al.  Remote data checking using provable data possession , 2011 .

[9]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[10]  Michael J. Freedman,et al.  Collaborative, Privacy-Preserving Data Aggregation at Scale , 2010, Privacy Enhancing Technologies.

[11]  Phillip Cykana,et al.  DoD Guidelines on Data Quality Management , 1996, IQ.

[12]  Ernesto Damiani,et al.  P2P-based collaborative spam detection and filtering , 2004 .

[13]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[14]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[15]  Lynn A. Karoly,et al.  Health Insurance Portability and Accountability Act of 1996 (HIPAA) Administrative Simplification , 2010, Practice Management Consultant.

[16]  Vitaly Shmatikov,et al.  Privacy-Preserving Sharing and Correlation of Security Alerts , 2004, USENIX Security Symposium.

[17]  Cong Wang,et al.  Enabling Public Verifiability and Data Dynamics for Storage Security in Cloud Computing , 2009, ESORICS.

[18]  Brent Waters,et al.  Conjunctive, Subset, and Range Queries on Encrypted Data , 2007, TCC.

[19]  Prateek Mittal,et al.  BotGrep: Finding P2P Bots with Structured Graph Analysis , 2010, USENIX Security Symposium.

[20]  Changyu Dong,et al.  When private set intersection meets big data: an efficient and scalable protocol , 2013, CCS.

[21]  Ivan Damgård,et al.  A Generalisation, a Simplification and Some Applications of Paillier's Probabilistic Public-Key System , 2001, Public Key Cryptography.

[22]  Alexandre V. Evfimievski,et al.  Information sharing across private databases , 2003, SIGMOD '03.

[23]  Stratis Ioannidis,et al.  Privacy-Preserving Ridge Regression on Hundreds of Millions of Records , 2013, 2013 IEEE Symposium on Security and Privacy.

[24]  Cong Wang,et al.  Privacy-Preserving Public Auditing for Data Storage Security in Cloud Computing , 2010, 2010 Proceedings IEEE INFOCOM.

[25]  Benny Pinkas,et al.  Faster Private Set Intersection Based on OT Extension , 2014, USENIX Security Symposium.

[26]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2000, Journal of Cryptology.

[27]  Jonathan Katz,et al.  Private Set Intersection: Are Garbled Circuits Better than Custom Protocols? , 2012, NDSS.

[28]  William Yurcik,et al.  Sharing computer network logs for security and privacy: a motivation for new methodologies of anonymization , 2005, Workshop of the 1st International Conference on Security and Privacy for Emerging Areas in Communication Networks, 2005..

[29]  Vitaly Shmatikov,et al.  Large-scale collection and sanitization of network security data: risks and challenges , 2006, NSPW '06.

[30]  Jelena Mirkovic,et al.  A Framework for a Collaborative DDoS Defense , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[31]  Susan Hohenberger,et al.  Honest-Verifier Private Disjointness Testing Without Random Oracles , 2006, Privacy Enhancing Technologies.

[32]  Emiliano De Cristofaro,et al.  Experimenting with Fast Private Set Intersection , 2012, TRUST.

[33]  Cong Wang,et al.  Enabling Public Auditability and Data Dynamics for Storage Security in Cloud Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[34]  Emiliano De Cristofaro,et al.  Practical Private Set Intersection Protocols with Linear Complexity , 2010, Financial Cryptography.

[35]  Emiliano De Cristofaro,et al.  Do I know you?: efficient and privacy-preserving common friend-finder protocols and applications , 2013, ACSAC.

[36]  Dawn Xiaodong Song,et al.  Privacy-Preserving Set Operations , 2005, CRYPTO.

[37]  Prateek Mittal,et al.  BotGrep : Finding Bots with Structured Graph Analysis , 2010 .

[38]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[39]  Eytan Adar,et al.  User 4XXXXX9: Anonymizing Query Logs , 2007 .

[40]  Diane M. Strong,et al.  AIMQ: a methodology for information quality assessment , 2002, Inf. Manag..

[41]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[42]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[43]  Benny Pinkas,et al.  Efficient Private Matching and Set Intersection , 2004, EUROCRYPT.

[44]  M. Burkhart,et al.  Fast Private Set Operations with SEPIA , 2012 .

[45]  Bill Cheswick,et al.  Privacy-Enhanced Searches Using Encrypted Bloom Filters , 2004, IACR Cryptol. ePrint Arch..

[46]  Ari Juels,et al.  Pors: proofs of retrievability for large files , 2007, CCS '07.

[47]  Charles V. Wright,et al.  Playing Devil's Advocate: Inferring Sensitive Information from Anonymized Network Traces , 2007, NDSS.

[48]  Modern age until Health Insurance Portability and Accountability Act , 2011 .

[49]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[50]  Balachander Krishnamurthy,et al.  Collaborating against common enemies , 2005, IMC '05.

[51]  Ian Goldberg,et al.  Privacy-Preserving Queries over Relational Databases , 2010, Privacy Enhancing Technologies.

[52]  Kiran Lakkaraju,et al.  Evaluating the utility of anonymized network traces for intrusion detection , 2007, SecureComm.

[53]  Phillip A. Porras,et al.  Highly Predictive Blacklisting , 2008, USENIX Security Symposium.

[54]  Wenliang Du,et al.  Privacy-preserving cooperative statistical analysis , 2001, Seventeenth Annual Computer Security Applications Conference.

[55]  Athina Markopoulou,et al.  Predictive Blacklisting as an Implicit Recommendation System , 2009, 2010 Proceedings IEEE INFOCOM.

[56]  Xiaomin Liu,et al.  Fast Secure Computation of Set Intersection , 2010, SCN.

[57]  Emiliano De Cristofaro,et al.  Fast and Private Computation of Cardinality of Set Intersection and Union , 2012, CANS.

[58]  Ernesto Damiani,et al.  An Open Digest-based Technique for Spam Detection , 2004, PDCS.

[59]  Xenofontas A. Dimitropoulos,et al.  SEPIA: Privacy-Preserving Aggregation of Multi-Domain Network Events and Statistics , 2010, USENIX Security Symposium.