UHRP: Uncertainty-Based Pruning Method for Anonymized Data Linear Regression

Anonymization method, as a kind of privacy protection technology for data publishing, has been heavily researched during the past twenty years. However, fewer researches have been conducted on making better use of the anonymized data for data mining. In this paper, we focus on training regression model using anonymized data and predicting on original samples using the trained model. Anonymized training instances are generally considered as hyper-rectangles, which is different from most machine learning tasks. We propose several hyper-rectangle vectorization methods that are compatible with both anonymized data and original data for model training. Anonymization brings additional uncertainty. To address this issue, we propose an Uncertainty-based Hyper-Rectangle Pruning method (UHRP) to reduce the disturbance introduced by anonymized data. In this method, we prune hyper-rectangle by its global uncertainty which is calculated from all uncertain attributes. Experiments show that a linear regressor trained on anonymized data could be expected to do as well as the model trained with original data under specific conditions. Experimental results also prove that our pruning method could further improve the model’s performance.

[1]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[2]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[3]  Nikos Mamoulis,et al.  Non-homogeneous generalization in privacy preserving data publishing , 2010, SIGMOD Conference.

[4]  Chris Clifton,et al.  Statistical Learning Theory Approach for Data Classification with ℓ-diversity , 2017, SDM.

[5]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[7]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[9]  Fabian Prasser,et al.  A Tool for Optimizing De-identified Health Data for Use in Statistical Classification , 2017, 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS).

[10]  Mario Milicevic,et al.  Effects of data anonymization on the data mining results , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[11]  Shi-Jinn Horng,et al.  Dynamic Fusion of Multisource Interval-Valued Data by Fuzzy Granulation , 2018, IEEE Transactions on Fuzzy Systems.

[12]  Mohammad Ghasem Akbari,et al.  Linear Model With Exact Inputs and Interval-Valued Fuzzy Outputs , 2018, IEEE Transactions on Fuzzy Systems.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Mohammad Ghasem Akbari,et al.  Signed-Distance Measures Oriented to Rank Interval-Valued Fuzzy Numbers , 2018, IEEE Transactions on Fuzzy Systems.

[16]  Aryya Gangopadhyay,et al.  A Privacy Protection Model for Patient Data with Multiple Sensitive Attributes , 2008, Int. J. Inf. Secur. Priv..

[17]  Elisa Bertino,et al.  Using Anonymized Data for Classification , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[19]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[20]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[21]  Steven Salzberg,et al.  A Nearest Hyperrectangle Learning Method , 1991, Machine Learning.

[22]  Bing-Rong Lin,et al.  Information Measures in Statistical Privacy and Data Processing Applications , 2015, TKDD.

[23]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .