Use Privacy in Data-Driven Systems: Theory and Experiments with Machine Learnt Programs

This paper presents an approach to formalizing and enforcing a class of use privacy properties in data-driven systems. In contrast to prior work, we focus on use restrictions on proxies (i.e. strong predictors) of protected information types. Our definition relates proxy use to intermediate computations that occur in a program, and identify two essential properties that characterize this behavior: 1) its result is strongly associated with the protected information type in question, and 2) it is likely to causally affect the final output of the program. For a specific instantiation of this definition, we present a program analysis technique that detects instances of proxy use in a model, and provides a witness that identifies which parts of the corresponding program exhibit the behavior. Recognizing that not all instances of proxy use of a protected information type are inappropriate, we make use of a normative judgment oracle that makes this inappropriateness determination for a given witness. Our repair algorithm uses the witness of an inappropriate proxy use to transform the model into one that provably does not exhibit proxy use, while avoiding changes that unduly affect classification accuracy. Using a corpus of social datasets, our evaluation shows that these algorithms are able to detect proxy use instances that would be difficult to find using existing techniques, and subsequently remove them while maintaining acceptable classification performance.

[1]  Franco Turini,et al.  k-NN as an implementation of situation testing for discrimination discovery and prevention , 2011, KDD.

[2]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[3]  J. Turow The Aisles Have Eyes: How Retailers Track Your Shopping, Strip Your Privacy, and Define Your Power , 2017 .

[4]  Roxana Geambasu,et al.  Sunlight: Fine-grained Targeting Detection at Scale with Statistical Confidence , 2015, CCS.

[5]  Daniel J. Solove A Taxonomy of Privacy , 2006 .

[6]  Justin Bleich,et al.  Forecasts of Violence to Inform Sentencing Decisions , 2014 .

[7]  Charles Duhigg,et al.  How Companies Learn Your Secrets , 2012 .

[8]  David Lazer,et al.  Measuring Price Discrimination and Steering on E-commerce Web Sites , 2014, Internet Measurement Conference.

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[11]  Toon Calders,et al.  Three naive Bayes approaches for discrimination-free classification , 2010, Data Mining and Knowledge Discovery.

[12]  Paulo Cortez,et al.  Using data mining to predict secondary school student performance , 2008 .

[13]  Balachander Krishnamurthy,et al.  Measuring personalization of web search , 2013, WWW.

[14]  Michael Carl Tschantz,et al.  Automated Experiments on Ad Privacy Settings: A Tale of Opacity, Choice, and Discrimination , 2014, ArXiv.

[15]  J. Murphy The General Data Protection Regulation (GDPR) , 2018, Irish medical journal.

[16]  Helen Nissenbaum,et al.  Privacy in Context - Technology, Policy, and the Integrity of Social Life , 2009 .

[17]  O. J. Dunn Estimation of the Medians for Dependent Variables , 1959 .

[18]  Anupam Datta Use Privacy in Data-Driven Systems , 2017 .

[19]  Edward W. Frees,et al.  Predictive Modeling Applications in Actuarial Science , 2014 .

[20]  Jeffrey Knockel,et al.  Every step you fake: a comparative analysis of fitness tracker privacy and security , 2016 .

[21]  R. Berk,et al.  Forecasting Domestic Violence: A Machine Learning Approach to Help Inform Arraignment Decisions , 2016 .

[22]  Limin Jia,et al.  Policy auditing over incomplete logs: theory, implementation and applications , 2011, CCS '11.

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Saikat Guha,et al.  Challenges in measuring online advertising systems , 2010, IMC '10.

[25]  A. Narayanan,et al.  Web Privacy Measurement : Scientific principles , engineering platform , and new results Draft – Jun 1 , 2014 , 2014 .

[26]  Jun Sakuma,et al.  Fairness-aware Learning through Regularization Approach , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[27]  Wouter Joosen,et al.  Crying wolf? On the price discrimination of online airline tickets , 2014, PETS 2014.

[28]  Roxana Geambasu,et al.  Discovering Unwarranted Associations in Data-Driven Applications with the FairTest Testing Toolkit , 2015, ArXiv.

[29]  Jeremy Shtern,et al.  Models of Global Internet Governance and the Projection of State Power: The Case of Facebook and the Office of the Privacy Commissioner of Canada , 2010 .

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Yair Zick,et al.  Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[32]  Meng Li,et al.  AN ANALYSIS OF US HOUSEHOLD SOCIOECONOMIC PROFILES BASED ON MARITAL STATUS AND GENDER , 2014 .

[33]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[34]  Prateek Mittal,et al.  Dependence Makes You Vulnberable: Differential Privacy Under Dependent Tuples , 2016, NDSS.

[35]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[36]  Michael Carl Tschantz,et al.  Formalizing and Enforcing Purpose Restrictions in Privacy Policies , 2012, 2012 IEEE Symposium on Security and Privacy.

[37]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[38]  Kyle Kubler The Black Box Society: the secret algorithms that control money and information , 2016 .

[39]  Saikat Guha,et al.  Bootstrapping Privacy Compliance in Big Data Systems , 2014, 2014 IEEE Symposium on Security and Privacy.

[40]  Kuldeep Kumar,et al.  A Comparative Analysis of Decision Trees Vis-à-vis Other Computational Data Mining Techniques in Automotive Insurance Fraud Detection , 2012 .

[41]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[42]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[43]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[44]  Raef Bassily,et al.  Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds , 2014, 1405.7085.

[45]  Roxana Geambasu,et al.  XRay: Enhancing the Web's Transparency with Differential Correlation , 2014, USENIX Security Symposium.

[46]  Ryen W. White,et al.  Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results. , 2016, Journal of oncology practice.

[47]  David Maxwell Chickering,et al.  A Decision Theoretic Approach to Targeted Advertising , 2000, UAI.

[48]  Qiang Ma,et al.  Adscape: harvesting and analyzing online display ads , 2014, WWW.

[49]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[50]  D. Nafus Quantified: Biosensing Technologies in Everyday Life , 2016 .

[51]  Geoffrey Smith,et al.  Recent Developments in Quantitative Information Flow (Invited Tutorial) , 2015, 2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science.

[52]  Charles A. McMellon Book Review: The Daily you: How the New Advertising Industry is Defining your Identity and your Worth , 2013 .

[53]  Craig E. Wills,et al.  Understanding what they do with what they know , 2012, WPES '12.