Protecting Anonymity in Data-Driven Biomedical Science

With formidable recent improvements in data processing and information retrieval, knowledge discovery/data mining, business intelligence, content analytics and other upcoming empirical approaches have an enormous potential, particularly for the data intensive biomedical sciences. For results derived using empirical methods, the underlying data set should be made available, at least during the review process for the reviewers, to ensure the quality of the research done and to prevent fraud or errors and to enable the replication of studies. However, in particular in the medicine and the life sciences, this leads to a discrepancy, as the disclosure of research data raises considerable privacy concerns, as researchers have of course the full responsibility to protect their (volunteer) subjects, hence must adhere to respective ethical policies. One solution for this problem lies in the protection of sensitive information in medical data sets by applying appropriate anonymization. This paper provides an overview on the most important and well-researched approaches and discusses open research problems in this area, with the goal to act as a starting point for further investigation.

[1]  Geir Hovland,et al.  Trends in Research and Publication: Science 2.0 and Open Access , 2009 .

[2]  Igor Jurisica,et al.  Knowledge Discovery and interactive Data Mining in Bioinformatics - State-of-the-Art, future challenges and research directions , 2014, BMC Bioinformatics.

[3]  Matthew Thompson,et al.  We need to move the debate on open clinical trial data forward , 2012, BMJ : British Medical Journal.

[4]  Claudia Eckert,et al.  Flash: Efficient, Stable and Optimal K-Anonymity , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[5]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[6]  Yufei Tao,et al.  Preservation of proximity privacy in publishing numerical sensitive data , 2008, SIGMOD Conference.

[7]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[8]  Stefan Fenz,et al.  Automated Transformation of Semi-Structured Text Elements , 2012, AMCIS.

[9]  Hui Zang,et al.  Anonymization of location data does not work: a large-scale measurement study , 2011, MobiCom.

[10]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[11]  Arti Deshpande,et al.  New Watermarking Technique for Relational Databases , 2009, 2009 Second International Conference on Emerging Trends in Engineering & Technology.

[12]  A Min Tjoa,et al.  Security aspects of ubiquitous computing in health care , 2006, Elektrotech. Informationstechnik.

[13]  Jure Leskovec,et al.  Human wayfinding in information networks , 2012, WWW.

[14]  Noboru Sonehara,et al.  An Algorithm for k-Anonymity-Based Fingerprinting , 2011, IWDW.

[15]  Peter Kieseberg,et al.  Anonymity and Pseudonymity in Data-Driven Science , 2014 .

[16]  Andreas Holzinger Biomedical Informatics: Discovering Knowledge in Big Data , 2014 .

[17]  Pern Hui Chia,et al.  Is this app safe?: a large scale study on application permissions and risk signals , 2012, WWW.

[18]  Darcy A. Davis,et al.  Bringing Big Data to Personalized Healthcare: A Patient-Centered Framework , 2013, Journal of General Internal Medicine.

[19]  Stefan Fenz,et al.  A rule-based transformation system for converting semi-structured medical documents , 2013 .

[20]  Andreas Holzinger,et al.  Functional and genetic analysis of the colon cancer network , 2014, BMC Bioinformatics.

[21]  Thomas Neubauer,et al.  Privacy-Preserving Storage and Access of Medical Data through Pseudonymization and Encryption , 2011, TrustBus.

[22]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[23]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[24]  Edgar R. Weippl,et al.  Towards Security-Enhanced and Privacy-Preserving Mashup Compositions , 2013, SEC.

[25]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[26]  Sujeet Shenoi,et al.  Security and Privacy Protection in Information Processing Systems , 2013, IFIP Advances in Information and Communication Technology.

[27]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[28]  Keith W. Ross,et al.  Facebook users have become much more private: A large-scale study , 2012, 2012 IEEE International Conference on Pervasive Computing and Communications Workshops.

[29]  Claudia Eckert,et al.  Highly efficient optimal k-anonymity for biomedical datasets , 2012, 2012 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS).

[30]  Thomas Neubauer,et al.  Pseudonymization with Metadata Encryption for Privacy-Preserving Searchable Documents , 2012, 2012 45th Hawaii International Conference on System Sciences.

[31]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[32]  Joseph Bonneau,et al.  The Science of Guessing: Analyzing an Anonymized Corpus of 70 Million Passwords , 2012, 2012 IEEE Symposium on Security and Privacy.

[33]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[34]  Andreas Pfitzmann,et al.  Anonymity, Unobservability, and Pseudonymity - A Proposal for Terminology , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[35]  Fernando Pérez-González,et al.  Digital-Forensics and Watermarking , 2014, Lecture Notes in Computer Science.

[36]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[37]  Philip S. Yu,et al.  Handicapping attacker's confidence: an alternative to k-anonymization , 2006, Knowledge and Information Systems.

[38]  Rita Noumeir,et al.  Pseudonymization of Radiology Data for Research Purposes , 2007, Journal of Digital Imaging.

[39]  Thomas Neubauer,et al.  A methodology for the pseudonymization of medical data , 2011, Int. J. Medical Informatics.

[40]  Edgar R. Weippl,et al.  An algorithm for collusion-resistant anonymization and fingerprinting of sensitive microdata , 2014, Electron. Mark..

[41]  Rakesh Agrawal,et al.  Watermarking Relational Databases , 2002, Very Large Data Bases Conference.

[42]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[43]  Anwar M. Ghuloum,et al.  ViewpointFace the inevitable, embrace parallelism , 2009, CACM.

[44]  Wolfgang Nejdl,et al.  How useful are your comments?: analyzing and predicting youtube comments and comment ratings , 2010, WWW '10.