Improved l-diversity: Scalable anonymization approach for Privacy Preserving Big Data Publishing

Abstract In the era of big data analytics, data owner is more concern about the data privacy. Data anonymization approaches such as k-anonymity, l-diversity, and t-closeness are used for a long time to preserve privacy in published data. However, these approaches cannot be directly applicable to a large amount of data. Distributed programming framework such as MapReduce and Spark are used for big data analytics which add more challenges to privacy preserving data publishing. Recently, we identified few scalable approaches for Privacy Preserving Big Data Publishing in literature and majority of them are based on k-anonymity and l-diversity. However, these approaches require a significant improvement to reach the level of existing privacy preserving data publishing approaches, therefore, we propose Improved Scalable l-Diversity (ImSLD) approach which is the extension of Improved Scalable k-Anonymity (ImSKA) for scalable anonymization in this paper. Our approaches are based on scalable k-anonymization that uses MapReduce as a programming paradigm. We use poker dataset and synthesize big data versions of poker dataset to test our approaches. The result analysis shows significant improvement in terms of running time due to the lesser number of MapReduce iterations and also exhibits lower information loss as compared to existing approaches while providing the same level of privacy due to tight arrangement of the records in the initial equivalence class.

[1]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Chedy Raïssi,et al.  Anonymizing set-valued data by nonreciprocal recoding , 2012, KDD.

[3]  Seref Sagiroglu,et al.  Privacy Preserving Big Data Publishing , 2018, 2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT).

[4]  Jinjun Chen,et al.  A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud , 2014, J. Comput. Syst. Sci..

[5]  Nikos Mamoulis,et al.  Non-homogeneous generalization in privacy preserving data publishing , 2010, SIGMOD Conference.

[6]  Sylvia L. Osborn,et al.  Delay-sensitive approaches for anonymizing numerical streaming data , 2013, International Journal of Information Security.

[7]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[8]  Udai Pratap Rao,et al.  Privacy Preserving Unstructured Big Data Analytics , 2016 .

[9]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[10]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[11]  Cynthia Dwork,et al.  On Privacy-Preserving Histograms , 2005, UAI.

[12]  Philip S. Yu,et al.  Bottom-up generalization: a data mining solution to privacy protection , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[13]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[14]  Cynthia Dwork,et al.  Ask a Better Question, Get a Better Answer A New Approach to Private Data Analysis , 2007, ICDT.

[15]  Chris Clifton,et al.  Hiding the presence of individuals from shared databases , 2007, SIGMOD '07.

[16]  Udai Pratap Rao,et al.  Towards Privacy Preserving Big Data Analytics , 2016 .

[17]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[18]  Bin Jiang,et al.  Continuous privacy preserving publishing of data streams , 2009, EDBT '09.

[19]  Nilay Khare,et al.  Big data privacy: a technological perspective and review , 2016, Journal of Big Data.

[20]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  G. Sudha Sadasivam,et al.  Privacy of Big Data: A Review , 2019, Handbook of Big Data and IoT Security.

[22]  Mauro Conti,et al.  Towards privacy preserving unstructured big data publishing , 2019, J. Intell. Fuzzy Syst..

[23]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[24]  Charu C. Aggarwal,et al.  Privacy-preserving big data publishing , 2015, SSDBM.

[25]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[26]  Xuyun Zhang,et al.  Privacy Preservation over Big Data in Cloud Systems , 2014 .

[27]  Udai Pratap Rao,et al.  Privacy preserving big data publishing: a scalable k-anonymization approach using MapReduce , 2017, IET Softw..

[28]  Jinjun Chen,et al.  A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud , 2014, IEEE Transactions on Parallel and Distributed Systems.

[29]  Chris Clifton,et al.  Multirelational k-Anonymity , 2009, IEEE Trans. Knowl. Data Eng..

[30]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[31]  Hoeteck Wee,et al.  Toward Privacy in Public Databases , 2005, TCC.

[32]  K. Liu,et al.  Towards identity anonymization on graphs , 2008, SIGMOD Conference.

[33]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[34]  Donald F. Towsley,et al.  Resisting structural re-identification in anonymized social networks , 2008, The VLDB Journal.

[35]  Divesh Srivastava,et al.  Anonymized Data: Generation, models, usage , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[36]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[37]  Chris Clifton,et al.  On syntactic anonymity and differential privacy , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[38]  Udai Pratap Rao,et al.  Toward Scalable Anonymization for Privacy-Preserving Big Data Publishing , 2018 .