Computational privacy : towards privacy-conscientious uses of metadata

Doctor of Philosophy The breadcrumbs left behind by our technologies have the power to fundamentally transform the health and development of societies. Metadata about our whereabouts, social lives, preferences, and finances can be used for good but can also be abused. In this thesis, I show that the richness of today's datasets have rendered traditional data protections strategies outdated, requiring us to deeply rethink our approach. First, I show that the concept of anonymization, central to legal and technical data protection frameworks, does not scale. I introduce the concept of unicity to study the risks of re-identification of large-scale metadata datasets given p points. I then use unicity to show that four spatio-temporal points are enough to uniquely identify 95% of people in a mobile phone dataset and 90% of people in a credit card dataset. In both cases, I also show that traditional de-identification strategies such as data generalization are not sufficient to approach anonymity in modern high-dimensional datasets. Second, I argue that the second pillar of data protection, risk assessment, is similarly crumbling as data gets richer. I show, for instance, how standard mobile phone data-information on how and when somebody calls or texts-can be used to predict personality traits up to 1.7 times better than random. The risk of inference in big data will render comprehensive risks assessments increasingly difficult and, moving forward, potentially irrelevant as they will require evaluating what can be inferred now, and in the future, from rich data. However, this data has a great potential for good especially in developing countries. While it is highly unlikely that we will ever find a magic bullet or even a onesize-fits-all approach to data protection, there are ways that exist to use metadata in privacy-conscientious ways. I finish this thesis by discussing technical solutions (incl. privacy-through-security ones) which, when combined with legal and regulatory frameworks, provide a reasonable balance between the imperative of using this data and the legitimate concerns of the individual and society. Thesis Supervisor: Prof. Alex "Sandy" Pentland Title: Toshiba Professor of Media Arts and Sciences

[1]  Elisa Bertino,et al.  Secure Anonymization for Incremental Datasets , 2006, Secure Data Management.

[2]  Etienne Huens,et al.  Data for Development: the D4D Challenge on Mobile Phone Data , 2012, ArXiv.

[3]  S. Vazire PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES Who Knows What About a Person ? The Self – Other Knowledge Asymmetry ( SOKA ) Model , 2010 .

[4]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[5]  G. Boulton Open your minds and share your results , 2012, Nature.

[6]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[7]  Y. de Montjoye,et al.  Unique in the shopping mall: On the reidentifiability of credit card metadata , 2015, Science.

[8]  Nitesh V. Chawla,et al.  Predictors of short-term decay of cell phone contacts in a large scale communication network , 2011, Soc. Networks.

[9]  Deborah Estrin,et al.  Personal data vaults: a locus of control for personal data streams , 2010, CoNEXT.

[10]  Harry Bouwman,et al.  Analysis of users and non-users of smartphone applications , 2010, Telematics Informatics.

[11]  宋金平,et al.  美国地理学百年发展脉络分析―基于《Annals of the Association of American Geographers》学术论文的统计分析 , 2007 .

[12]  L. Mui,et al.  A computational model of trust and reputation , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[13]  Nuria Oliver,et al.  Towards a psychographic user model from mobile phone usage , 2011, CHI Extended Abstracts.

[14]  Laks V. S. Lakshmanan,et al.  Anonymizing moving objects: how to hide a MOB in a crowd? , 2009, EDBT '09.

[15]  Margaret Martonosi,et al.  DP-WHERE: Differentially private modeling of human mobility , 2013, 2013 IEEE International Conference on Big Data.

[16]  Anind K. Dey,et al.  Who wants to know what when? privacy preference determinants in ubiquitous computing , 2003, CHI Extended Abstracts.

[17]  Luk Arbuckle,et al.  El Emam Et Al.: the De‐identification of the Heritage Health Prize Claims Data Set Multimedia Appendix Multimedia Appendix 1 Truncation of Claims 2 Removal of High Risk Patients , 2022 .

[18]  James A. Landay,et al.  An architecture for privacy-sensitive ubiquitous computing , 2004, MobiSys '04.

[19]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[20]  Michael Szell,et al.  Multirelational organization of large-scale social networks in an online world , 2010, Proceedings of the National Academy of Sciences.

[21]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.

[22]  Stuart M. Allen,et al.  Measuring Individual Regularity in Human Visiting Patterns , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[23]  J. Burgon Making the links. , 2002, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[24]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[25]  Judy Kay,et al.  Creating personalized systems that people can scrutinize and control: Drivers, principles and experience , 2012, TIIS.

[26]  R. McCrae,et al.  An introduction to the five-factor model and its applications. , 1992, Journal of personality.

[27]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[28]  Michael Clanchy,et al.  From Memory to Written Record: England 1066 - 1307 , 1981 .

[29]  Mu-Chen Chen,et al.  Credit scoring with a data mining approach based on support vector machines , 2007, Expert Syst. Appl..

[30]  Hugh A. Taylor,et al.  CLANCHY, From Memory to Written Record: England, 1066–1307 , 1979 .

[31]  Marco Luca Sbodio,et al.  AllAboard: A System for Exploring Urban Mobility and Optimizing Public Transport Using Cellphone Data , 2013, ECML/PKDD.

[32]  John P. Campbell,et al.  To Be, or Not to Be, Linear: An Expanded Representation of Personality and its Relationship to Leadership Performance , 2007 .

[33]  Zbigniew Smoreda,et al.  Delineating Geographical Regions with Networks of Human Interactions in an Extensive Set of Countries , 2013, PloS one.

[34]  Kristopher J Preacher,et al.  On the practice of dichotomization of quantitative variables. , 2002, Psychological methods.

[35]  Jaap J. A. Denissen,et al.  Emerging late adolescent friendship networks and Big Five personality traits: a social network approach. , 2010, Journal of personality.

[36]  Nadia Mana,et al.  Multimodal recognition of personality traits in social interactions , 2008, ICMI '08.

[37]  Raquel Hill,et al.  Uniqueness and how it impacts privacy in health-related social science datasets , 2012, IHI '12.

[38]  Judy Kay,et al.  PersonisAD: Distributed, Active, Scrutable Model Framework for Context-Aware Services , 2007, Pervasive.

[39]  Alex Pentland,et al.  The predictability of consumer visitation patterns , 2010, Scientific Reports.

[40]  Declan Butler Data sharing threatens privacy , 2007, Nature.

[41]  Zbigniew Smoreda,et al.  D4D-Senegal: The Second Mobile Phone Data for Development Challenge , 2014, ArXiv.

[42]  Alessandro Vespignani,et al.  Modeling human mobility responses to the large-scale spreading of infectious diseases , 2011, Scientific reports.

[43]  Hal Abelson,et al.  Blown to Bits: Your Life, Liberty, and Happiness After the Digital Explosion , 2008 .

[44]  Latanya Sweeney,et al.  Matching Known Patients to Health Records in Washington State Data , 2013, ArXiv.

[45]  Arvind Narayanan,et al.  No silver bullet: De-identification still doesn't work , 2014 .

[46]  Michael Kearns,et al.  Privacy-Preserving Belief Propagation and Sampling , 2007, NIPS.

[47]  Sonia M. Arteaga,et al.  Combating obesity trends in teenagers through persuasive mobile technology , 2009, ASAC.

[48]  Ehud Gudes,et al.  Implementing a database encryption solution, design and implementation issues , 2014, Comput. Secur..

[49]  Siddhartha Bhattacharyya,et al.  Data mining for credit card fraud: A comparative study , 2011, Decis. Support Syst..

[50]  Ian Goldberg,et al.  Louis, Lester and Pierre: Three Protocols for Location Privacy , 2007, Privacy Enhancing Technologies.

[51]  A-L Barabási,et al.  Structure and tie strengths in mobile communication networks , 2006, Proceedings of the National Academy of Sciences.

[52]  D. Watts Everything Is Obvious: *Once You Know the Answer , 2011 .

[53]  César A. Hidalgo,et al.  Unique in the Crowd: The privacy bounds of human mobility , 2013, Scientific Reports.

[54]  Alex Pentland,et al.  Big Data-Driven Marketing: How Machine Learning Outperforms Marketers' Gut-Feeling , 2014, SBP.

[55]  Hui Zang,et al.  Anonymization of location data does not work: a large-scale measurement study , 2011, MobiCom.

[56]  Nicu Sebe,et al.  Friends don't lie: inferring personality traits from social network structure , 2012, UbiComp.

[57]  Dr B Santhosh Kumar Santhosh Balan,et al.  Closeness : A New Privacy Measure for Data Publishing , 2022 .

[58]  Juliane M. Stopfer,et al.  Facebook Profiles Reflect Actual Personality, Not Self-Idealization , 2010, Psychological science.

[59]  A. Tatem,et al.  Dynamic population mapping using mobile phone data , 2014, Proceedings of the National Academy of Sciences.

[60]  Jonathan Zittrain,et al.  Better Data for a Better Internet , 2011, Science.

[61]  Scott Counts,et al.  Self-Presentation of Personality During Online Profile Creation , 2009, ICWSM.

[62]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[63]  Alex Pentland,et al.  Social fMRI: Investigating and shaping social mechanisms in the real world , 2011, Pervasive Mob. Comput..

[64]  Vldb Endowment,et al.  The VLDB journal : the international journal on very large data bases. , 1992 .

[65]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[66]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[67]  Claudio Orlandi,et al.  Is multiparty computation any good in practice? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Alex Pentland,et al.  On the Trusted Use of Large-Scale Personal Data , 2012, IEEE Data Eng. Bull..

[69]  Anna Monreale,et al.  Movement data anonymity through generalization , 2009, SPRINGL '09.

[70]  S. Grossberg,et al.  Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors , 1976, Biological Cybernetics.

[71]  Philippe Golle,et al.  On the Anonymity of Home/Work Location Pairs , 2009, Pervasive.

[72]  Erez Shmueli,et al.  openPDS: Protecting the Privacy of Metadata through SafeAnswers , 2014, PloS one.

[73]  Nathan Eagle,et al.  Community Computing: Comparisons between Rural and Urban Societies Using Mobile Phone Data , 2009, 2009 International Conference on Computational Science and Engineering.

[74]  C. Rodriguez-Sickert,et al.  The dynamics of a mobile phone network , 2007, 0712.4031.

[75]  Bobby Bhattacharjee,et al.  Persona: an online social network with user-defined privacy , 2009, SIGCOMM '09.

[76]  Hari Balakrishnan,et al.  CryptDB: protecting confidentiality with encrypted query processing , 2011, SOSP.

[77]  J. Rubenfeld The Right of Privacy , 1989 .

[78]  Jane Yakowitz,et al.  Tragedy of the Data Commons , 2011 .

[79]  Craig Gentry,et al.  A fully homomorphic encryption scheme , 2009 .

[80]  Franco Zambonelli,et al.  Re-identification and information fusion between anonymized CDR and social network data , 2016, J. Ambient Intell. Humaniz. Comput..

[81]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[82]  R. Steinbrook Personally controlled online health data--the next big thing in medical care? , 2008, The New England journal of medicine.

[83]  L. Bengtsson,et al.  Improved Response to Disasters and Outbreaks by Tracking Population Movements with Mobile Phone Network Data: A Post-Earthquake Geospatial Study in Haiti , 2011, PLoS medicine.

[84]  Journals unite for reproducibility , 2014, Nature.

[85]  Rapson Gomez,et al.  Personality traits of the behavioural approach and inhibition systems: associations with processing of emotional stimuli , 2002 .

[86]  Albert-László Barabási,et al.  Limits of Predictability in Human Mobility , 2010, Science.

[87]  Lior Rokach,et al.  Limiting disclosure of sensitive data in sequential releases of databases , 2012, Inf. Sci..

[88]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[89]  Audun Jøsang,et al.  A survey of trust and reputation systems for online service provision , 2007, Decis. Support Syst..

[90]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[91]  Paul M. Schwartz,et al.  Reconciling Personal Information in the United States and European Union , 2013 .

[92]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[93]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[94]  Curtis R. Taylor,et al.  The Economics of Privacy , 2016 .

[95]  Antonio Lima,et al.  Exploiting Cellular Data for Disease Containment and Information Campaigns Strategies in Country-Wide Epidemics , 2013, ArXiv.

[96]  Alex Pentland,et al.  Privacy in Sensor-Driven Human Data Collection: A Guide for Practitioners , 2014, ArXiv.

[97]  Stéphane Bressan,et al.  Not So Unique in the Crowd: a Simple and Effective Algorithm for Anonymizing Location Data , 2014, PIR@SIGIR.

[98]  Ira S. Rubinstein,et al.  Big Data: The End of Privacy or a New Beginning? , 2013 .

[99]  Bradley Malin,et al.  Assessing data intrusion threats. , 2015, Science.

[100]  Douglas M. Bates,et al.  Nonlinear Regression Analysis and Its Applications , 1988 .

[101]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[102]  Scott Counts,et al.  Spontaneous Inference of Personality Traits and Effects on Memory for Online Profiles , 2021, ICWSM.

[103]  S. Srivastava,et al.  The Big Five Trait taxonomy: History, measurement, and theoretical perspectives. , 1999 .

[104]  R. Lynn,et al.  Gender differences in extraversion, neuroticism, and psychoticism in 37 nations. , 1997, The Journal of social psychology.

[105]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[106]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.

[107]  Alex Pentland,et al.  Predicting Personality Using Novel Mobile Phone-Based Metrics , 2013, SBP.

[108]  Alex Pentland,et al.  Society's Nervous System: Building Effective Government, Energy, and Public Health Systems , 2012, Computer.

[109]  Pierangela Samarati,et al.  Location privacy in pervasive computing , 2008 .

[110]  L. Sweeney Simple Demographics Often Identify People Uniquely , 2000 .

[111]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[112]  Alessandro Vespignani,et al.  Multiscale mobility networks and the spatial spreading of infectious diseases , 2009, Proceedings of the National Academy of Sciences.

[113]  Khashayar Pakdaman,et al.  Commuter Mobility and the Spread of Infectious Diseases: Application to Influenza in France , 2014, PloS one.

[114]  Marina Blanton,et al.  Secure Multiparty Computation , 2011, Encyclopedia of Cryptography and Security.

[115]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[116]  Richard D. Roberts,et al.  Empirical identification of the major facets of Conscientiousness , 2009 .

[117]  Jeffrey M. Cucina,et al.  Nonlinear personality-performance relationships and the spurious moderating effects of traitedness. , 2005, Journal of personality.

[118]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[119]  Mark A. Lemley,et al.  Rules for Growth: Promoting Innovation and Growth Through Legal Reform , 2011 .

[120]  David Lazer,et al.  Inferring friendship network structure by using mobile phone data , 2009, Proceedings of the National Academy of Sciences.

[121]  Duane DeSieno,et al.  Adding a conscience to competitive learning , 1988, IEEE 1988 International Conference on Neural Networks.

[122]  Daniel Gatica-Perez,et al.  Mining large-scale smartphone data for personality studies , 2013, Personal and Ubiquitous Computing.

[123]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[124]  Panos Kalnis,et al.  SABRE: a Sensitive Attribute Bucketization and REdistribution framework for t-closeness , 2011, The VLDB Journal.

[125]  Ramón Cáceres,et al.  Virtual individual servers as privacy-preserving proxies for mobile devices , 2009, MobiHeld '09.

[126]  Yves-Alexandre de Montjoye,et al.  Assessing data intrusion threats--response. , 2015, Science.

[127]  Gordon Bell,et al.  A personal digital store , 2001, CACM.

[128]  Ling Liu,et al.  Location Privacy in Mobile Systems: A Personalized Anonymization Model , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[129]  Yves-Alexandre de Montjoye,et al.  Enabling Humanitarian Use of Mobile Phone Data , 2014, Trusted Data.

[130]  E. Locard Traité de criminalistique , 1931 .

[131]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[132]  Daniel Gatica-Perez,et al.  By their apps you shall understand them: mining large-scale patterns of mobile phone usage , 2010, MUM.

[133]  Paul M. Schwartz,et al.  Property, Privacy, and Personal Data , 2004 .

[134]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[135]  Jonathan Reades Finite State Machines: Preserving Privacy When Data-Mining Cellular Phone Networks , 2010 .

[136]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[137]  Sushil Jajodia,et al.  Privacy in geo-social networks: proximity notification with untrusted service providers and curious buddies , 2010, The VLDB Journal.

[138]  Nathan Eagle,et al.  Persistence and periodicity in a dynamic proximity network , 2012, ArXiv.

[139]  N. Eagle,et al.  Network Diversity and Economic Development , 2010, Science.

[140]  Roy Want,et al.  The Personal Server: Changing the Way We Think about Ubiquitous Computing , 2002, UbiComp.

[141]  Minlan Yu,et al.  CloudPolice: taking access control out of the network , 2010, Hotnets-IX.

[142]  A. Tatem,et al.  Commentary: Containing the Ebola Outbreak - the Potential and Challenge of Mobile Network Data , 2014, PLoS currents.

[143]  Alex Pentl,et al.  Reality Mining of Mobile Communications: Toward A New Deal On Data , 2009 .

[144]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[145]  David L. Smith,et al.  Quantifying the Impact of Human Mobility on Malaria , 2012, Science.

[146]  Peter Gould,et al.  LETTING THE DATA SPEAK FOR THEMSELVES , 1981 .