Privacy preserving big data analytics: A critical analysis of state‐of‐the‐art

In the era of “big data,” a huge number of people, devices, and sensors are connected via digital networks and the cross‐plays among these entities generate enormous valuable data that facilitate organizations to innovate and grow. However, the data deluge also raises serious privacy concerns which may cause a regulatory backlash and hinder further organizational innovation. To address the challenge of information privacy, researchers have explored privacy‐preserving methodologies in the past two decades. However, a thorough study of privacy preserving big data analytics is missing in existing literature. The main contributions of this article include a systematic evaluation of various privacy preservation approaches and a critical analysis of the state‐of‐the‐art privacy preserving big data analytics methodologies. More specifically, we propose a four‐dimensional framework for analyzing and designing the next generation of privacy preserving big data analytics approaches. Besides, we contribute to pinpoint the potential opportunities and challenges of applying privacy preserving big data analytics to business settings. We provide five recommendations of effectively applying privacy‐preserving big data analytics to businesses. To the best of our knowledge, this is the first systematic study about state‐of‐the‐art in privacy‐preserving big data analytics. The managerial implication of our study is that organizations can apply the results of our critical analysis to strengthen their strategic deployment of big data analytics in business settings, and hence to better leverage big data for sustainable organizational innovation and growth.

[1]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[2]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[3]  Huseyin Polat,et al.  Privacy-Preserving SVD-Based Collaborative Filtering on Partitioned Data , 2010, Int. J. Inf. Technol. Decis. Mak..

[4]  A. Narayanan,et al.  Robust de-anonymization of large sparse datasets : a decade later , 2019 .

[5]  Jaideep Vaidya,et al.  Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data , 2006, SAC.

[6]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[7]  Laurence T. Yang,et al.  Privacy-preserving clustering for big data in cyber-physical-social systems: Survey and perspectives , 2020, Inf. Sci..

[8]  Dursun Delen,et al.  Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud , 2013, Decis. Support Syst..

[9]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[11]  Yücel Saygin,et al.  Privacy Preserving Clustering on Horizontally Partitioned Data , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[12]  Josep Domingo-Ferrer,et al.  From t-Closeness-Like Privacy to Postrandomization via Information Theory , 2010, IEEE Transactions on Knowledge and Data Engineering.

[13]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[14]  Yuan Li,et al.  The impact of disposition to privacy, website reputation and website familiarity on information privacy concerns , 2014, Decis. Support Syst..

[15]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[16]  Xuyun Zhang,et al.  SaC‐FRAPP: a scalable and cost‐effective framework for privacy preservation over big data on cloud , 2013, Concurr. Comput. Pract. Exp..

[17]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Jie Wang,et al.  Wavelet-Based Data Perturbation for Simultaneous Privacy-Preserving and Statistics-Preserving , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[19]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[20]  David J. DeWitt,et al.  Workload-aware anonymization techniques for large-scale datasets , 2008, TODS.

[21]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[22]  Vicenç Torra,et al.  Data privacy , 2014, Advanced Research in Data Privacy.

[23]  Jie Wang,et al.  Knowledge and Information Systems REGULAR PAPER , 2006 .

[24]  Yehuda Lindell,et al.  Secure Multiparty Computation for Privacy-Preserving Data Mining , 2009, IACR Cryptol. ePrint Arch..

[25]  Shuting Xu,et al.  Fast Fourier Transform Based Data Perturbation Method for Privacy Protection , 2007, 2007 IEEE Intelligence and Security Informatics.

[26]  Balázs Kégl,et al.  Privacy-preserving boosting , 2007, Data Mining and Knowledge Discovery.

[27]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[28]  Daniel A. Keim,et al.  Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , 2002, KDD.

[29]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[30]  Marit Hansen,et al.  Privacy and Identity Management , 2008, IEEE Security & Privacy.

[31]  Ling Liu,et al.  A Customizable k-Anonymity Model for Protecting Location Privacy , 2004 .

[32]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[33]  Lei Liu,et al.  Optimal randomization for privacy preserving data mining , 2004, KDD.

[34]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[35]  Ran Wolff,et al.  k-TTP: a new privacy model for large-scale distributed environments , 2004, KDD.

[36]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[37]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[38]  Jiyi Wu,et al.  A review on sentiment discovery and analysis of educational big‐data , 2020, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[39]  Cynthia Dwork,et al.  Privacy-Preserving Datamining on Vertically Partitioned Databases , 2004, CRYPTO.

[40]  Shouhuai Xu,et al.  k-anonymous secret handshakes with reusable credentials , 2004, CCS '04.

[41]  Rob Hall,et al.  Privacy-Preserving Record Linkage , 2010, Privacy in Statistical Databases.

[42]  Hongtao Li,et al.  D2D Big Data Privacy-Preserving Framework Based on (a, k)-Anonymity Model , 2019 .

[43]  Jaideep Vaidya,et al.  Privacy Preserving Naive Bayes Classifier for Horizontally Partitioned Data , 2003 .

[44]  Victor I. Chang,et al.  Privacy-preserving fusion of IoT and big data for e-health , 2018, Future Gener. Comput. Syst..

[45]  Xiao-Bai Li,et al.  Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining , 2009, Decis. Support Syst..

[46]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[47]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[48]  Ibrahim Khalil,et al.  An Efficient and Scalable Privacy Preserving Algorithm for Big Data and Data Streams , 2019, Comput. Secur..

[49]  Jinjun Chen,et al.  Combining Top-Down and Bottom-Up: Scalable Sub-tree Anonymization over Big Data Using MapReduce on Cloud , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[50]  Haomiao Yang,et al.  Towards Efficient and Privacy-Preserving Federated Deep Learning , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[51]  David Gefen,et al.  The impact of personal dispositions on information sensitivity, privacy concern and trust in disclosing health information online , 2010, Decis. Support Syst..

[52]  Osmar R. Zaïane,et al.  Data Perturbation by Rotation for Privacy-Preserving Clustering , 2004 .

[53]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[54]  Alexandre V. Evfimievski,et al.  Randomization in privacy preserving data mining , 2002, SKDD.

[55]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[56]  Jun'ichi Tatemura,et al.  Incremental maintenance of path-expression views , 2005, SIGMOD '05.

[57]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[58]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[59]  Rebecca N. Wright,et al.  Privacy-preserving distributed k-means clustering over arbitrarily partitioned data , 2005, KDD '05.

[60]  Wenliang Du,et al.  Privacy-preserving top-N recommendation on horizontally partitioned data , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[61]  L. C. Smith Privacy-Preserving Collaborative Filtering Using Randomized Perturbation Techniques , 2013 .

[62]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[63]  Stefanos Gritzalis,et al.  Accurate and large-scale privacy-preserving data mining using the election paradigm , 2009, Data Knowl. Eng..

[64]  Victor I. Chang,et al.  Privacy-preserving smart IoT-based healthcare big data storage and self-adaptive access control system , 2018, Inf. Sci..

[65]  Raymond Y. K. Lau,et al.  Big data analytics for security and criminal investigations , 2017, WIREs Data Mining Knowl. Discov..

[66]  Yitao Duan,et al.  Practical Distributed Privacy-Preserving Data Analysis at Large Scale , 2014, Large-Scale Data Analytics.

[67]  K. Sandhya Rani,et al.  Privacy Preserving Association Rule Mining in Vertically Partitioned Databases , 2012 .

[68]  Wenliang Du,et al.  Privacy-preserving collaborative filtering using randomized perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[69]  Chris Clifton,et al.  Privacy-Preserving Distributed k-Anonymity , 2005, DBSec.

[70]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[71]  William E. Winkler,et al.  Multiplicative Noise for Masking Continuous Data , 2001 .

[72]  Joseph K. Liu,et al.  Toward efficient and privacy-preserving computing in big data era , 2014, IEEE Network.

[73]  William E. Winkler,et al.  Using Simulated Annealing for k-anonymity , 2002 .

[74]  Chris Clifton,et al.  Privacy-Preserving Decision Trees over Vertically Partitioned Data , 2005, DBSec.

[75]  Raymond Y. K. Lau,et al.  Healthcare informatics and analytics in big data , 2020, Expert Syst. Appl..

[76]  Shamily Shaji,et al.  A MapReduce based Approach of Scalable Multidimensional Anonymization for Big Data Privacy Preservation on Hadoop , 2015 .

[77]  Stephen E. Fienberg,et al.  Data Swapping: Variations on a Theme by Dalenius and Reiss , 2004, Privacy in Statistical Databases.

[78]  Xiaojiang Du,et al.  Achieving big data privacy via hybrid cloud , 2014, 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[79]  Richard A. Gibbs,et al.  No Longer De-Identified , 2006, Science.

[80]  Rathindra Sarathy,et al.  Secure and useful data sharing , 2006, Decis. Support Syst..

[81]  Charu C. Aggarwal,et al.  On Randomization, Public Information and the Curse of Dimensionality , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[82]  Jinjun Chen,et al.  A MapReduce Based Approach of Scalable Multidimensional Anonymization for Big Data Privacy Preservation on Cloud , 2013, 2013 International Conference on Cloud and Green Computing.

[83]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..