PACAS: Privacy-Aware, Data Cleaning-as-a-Service

Data cleaning consumes up to 80% of the data analysis pipeline. This is a significant overhead for organizations where data cleaning is still a manually driven process requiring domain expertise. Recent advances have fueled a new computing paradigm called Database-as-a-Service, where data management tasks are outsourced to large service providers. We propose a new Data Cleaning-as-a-Service model that allows a client to interact with a data cleaning provider who hosts curated, and sensitive data. We present PACAS: a Privacy-Aware data Cleaning-As-a-Service framework that facilitates communication between the client and the service provider via a data pricing scheme where clients issue queries, and the service provider returns clean answers for a price while protecting her data. We propose a practical privacy model in such interactive settings called (X,Y,L)-anonymity that extends existing data publishing techniques to consider the data semantics while protecting sensitive values. Our evaluation over real data shows that PACAS effectively safeguards semantically related sensitive values, and provides improved accuracy over existing privacy-aware cleaning techniques.

[1]  Dan Suciu,et al.  Data Markets in the Cloud: An Opportunity for the Database Community , 2011, Proc. VLDB Endow..

[2]  Tim Kraska,et al.  PrivateClean: Data Cleaning and Differential Privacy , 2016, SIGMOD Conference.

[3]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[4]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[5]  Tamraparni Dasu,et al.  Statistical Distortion: Consequences of Data Cleaning , 2012, Proc. VLDB Endow..

[6]  Aaron Roth,et al.  Buying private data at auction: the sensitive surveyor's problem , 2012, SECO.

[7]  Shaleen Deep,et al.  QIRANA: A Framework for Scalable Query Pricing , 2017, SIGMOD Conference.

[8]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[9]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[10]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[11]  Lukasz Golab,et al.  On the relative trust between inconsistent data and inaccurate constraints , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[13]  Paolo Papotti,et al.  Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms , 2015, Proc. VLDB Endow..

[14]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[15]  Yu Huang,et al.  PARC: Privacy-Aware Data Cleaning , 2016, CIKM.

[16]  Renée J. Miller,et al.  Automatic Curation of Clinical Trials Data in LinkedCT , 2015, International Semantic Web Conference.

[17]  Fei Chiang,et al.  InfoClean , 2018 .

[18]  Rebecca N. Wright,et al.  Privacy-preserving imputation of missing data , 2008, Data Knowl. Eng..

[19]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.