Systematic Literature Review on the Anonymization of High Dimensional Streaming Datasets for Health Data Sharing

One of the biggest challenges to health data sharingis regulations that prohibit the transmission and distribution of Personal Health Information (PHI) even among collaborating organizations. This impedes research and reduces the utility of these datasets. Anonymization can address this issue by hidingPHI while maintaining the analytical utility of the data. Much research has focused on data that is static, independent and complete. Unfortunately, this is not typical of health data. Instead of static, independent tables, health data is in relational databases with multiple high-dimensional tables that are transactional and constantly changing. Data recipients usually receive multiple versions of the database over time. This study reviews literature on anonymization methodologies for large and fast changing high-dimensional datasets, especially health data. Relevant papers are analyzed, categorized and compared in terms of scope, and contributions. Finally, we used the extracted details from our analysis to outline possible research direction for developing a realistic anonymization framework for health data sharing.

[1]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[2]  Claudia Eckert,et al.  A flexible approach to distributed data anonymization , 2014, J. Biomed. Informatics.

[3]  Peter Christen,et al.  A taxonomy of privacy-preserving record linkage techniques , 2013, Inf. Syst..

[4]  Yon Dohn Chung,et al.  A framework to preserve the privacy of electronic health data streams , 2014, J. Biomed. Informatics.

[5]  Benjamin C. M. Fung,et al.  Service-Oriented Architecture for High-Dimensional Private Data Mashup , 2012, IEEE Transactions on Services Computing.

[6]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[7]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.

[9]  Elisa Bertino,et al.  Secure Anonymization for Incremental Datasets , 2006, Secure Data Management.

[10]  Charu C. Aggarwal Privacy and the Dimensionality Curse , 2008, Privacy-Preserving Data Mining.

[11]  Yücel Saygin,et al.  Anonymization of Longitudinal Electronic Medical Records , 2012, IEEE Transactions on Information Technology in Biomedicine.

[12]  Qishan Zhang,et al.  Fast clustering-based anonymization approaches with time constraints for data streams , 2013, Knowl. Based Syst..

[13]  Mehmet Ercan Nergiz,et al.  Hybrid k-Anonymity , 2014, Comput. Secur..

[14]  Saeed Jalili,et al.  Fast data-oriented microaggregation algorithm for large numerical datasets , 2014, Knowl. Based Syst..

[15]  Benjamin C. M. Fung,et al.  Centralized and Distributed Anonymization for High-Dimensional Healthcare Data , 2010, TKDD.

[16]  Yingjie Wu,et al.  Privacy Preserving k-Anonymity for Re-publication of Incremental Datasets , 2009, 2009 WRI World Congress on Computer Science and Information Engineering.

[17]  Claudia Eckert,et al.  Flash: Efficient, Stable and Optimal K-Anonymity , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[18]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[19]  Ashwin Machanavajjhala,et al.  Privacy-Preserving Data Publishing , 2009, Found. Trends Databases.

[20]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[22]  Aryya Gangopadhyay,et al.  A data recipient centered de-identification method to retain statistical attributes , 2014, J. Biomed. Informatics.

[23]  Magnus Jändel,et al.  Decision support for releasing anonymised data , 2014, Comput. Secur..

[24]  Benjamin C. M. Fung,et al.  Privacy-preserving data publishing , 2007 .

[25]  Liam Peyton,et al.  Policy-based Data Integration for e-Health Monitoring Processes in a B2B Environment: Experiences from Canada , 2010, J. Theor. Appl. Electron. Commer. Res..

[26]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[27]  Raymond Chi-Wing Wong,et al.  Anonymization by Local Recoding in Data with Attribute Hierarchical Taxonomies , 2008, IEEE Transactions on Knowledge and Data Engineering.

[28]  Jimeng Sun,et al.  Publishing data from electronic health records while preserving privacy: A survey of algorithms , 2014, J. Biomed. Informatics.

[29]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.

[30]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[31]  Donald F. Towsley,et al.  Resisting structural re-identification in anonymized social networks , 2008, The VLDB Journal.

[32]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[33]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[34]  Jian Pei,et al.  Maintaining K-Anonymity against Incremental Updates , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[35]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[36]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[37]  Chris Clifton,et al.  Multirelational k-Anonymity , 2009, IEEE Trans. Knowl. Data Eng..

[38]  Kuo-Liang Chung,et al.  Efficient algorithms for coding Hilbert curve of arbitrary-sized image and application to window query , 2007, Inf. Sci..

[39]  Yücel Saygin,et al.  Privacy-preserving publishing of opinion polls , 2013, Comput. Secur..

[40]  Kian-Lee Tan,et al.  CASTLE: Continuously Anonymizing Data Streams , 2011, IEEE Transactions on Dependable and Secure Computing.

[41]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.