A Comparison of the Effects of K-Anonymity on Machine Learning Algorithms

While research has been conducted in machine learning algorithms and in privacy preserving in data mining (PPDM), a gap in the literature exists which combines the aforementioned areas to determine how PPDM affects common machine learning algorithms. The aim of this research is to narrow this literature gap by investigating how a common PPDM algorithm, K-Anonymity, affects common machine learning and data mining algorithms, namely neural networks, logistic regression, decision trees, and Bayesian classifiers. This applied research reveals practical implications for applying PPDM to data mining and machine learning and serves as a critical first step learning how to apply PPDM to machine learning algorithms and the effects of PPDM on machine learning. Results indicate that certain machine learning algorithms are more suited for use with PPDM techniques.

[1]  Melody Y. Kiang,et al.  Predicting Bank Failures: A neural network approach , 1990, Appl. Artif. Intell..

[2]  R. Uzzo,et al.  Patients with anatomically "simple" renal masses are more likely to be placed on active surveillance than those with anatomically "complex" lesions. , 2014, Urologic oncology.

[3]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[4]  Yue Dong,et al.  Research the association of dangerous driving behavior and traffic congestion based on C4.5 algorithm , 2014 .

[5]  Lun-Ping Hung,et al.  A data driven ensemble classifier for credit scoring analysis , 2010, Expert Syst. Appl..

[6]  Ramesh Sharda,et al.  Bankruptcy prediction using neural networks , 1994, Decis. Support Syst..

[7]  A. Gupta,et al.  A Bayesian Approach to , 1997 .

[8]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[9]  Vassilios S. Verykios Association rule hiding methods , 2009, Encyclopedia of Data Warehousing and Mining.

[10]  Mohammad Al Hasan,et al.  A Survey of Link Prediction in Social Networks , 2011, Social Network Data Analytics.

[11]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[12]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[13]  S Van Huffel,et al.  Ovarian cancer prediction in adnexal masses using ultrasound‐based logistic regression models: a temporal and external validation study by the IOTA group , 2010, Ultrasound in obstetrics & gynecology : the official journal of the International Society of Ultrasound in Obstetrics and Gynecology.

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[16]  Pamela K. Coats,et al.  Recognizing Financial Distress Patterns Using a Neural Network Tool , 1993 .

[17]  Balaji Rajagopalan,et al.  Financial decision support with hybrid genetic and neural based modeling tools , 1997 .

[18]  Mike Conway,et al.  Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features , 2013, Biomedical informatics insights.

[19]  S. Yitzhaki,et al.  A note on the calculation and interpretation of the Gini index , 1984 .

[20]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Jianping Zhang,et al.  Selecting Typical Instances in Instance-Based Learning , 1992, ML.

[22]  L. Sweeney Simple Demographics Often Identify People Uniquely , 2000 .

[23]  Qeethara Al-Shayea Artificial Neural Networks in Medical Diagnosis , 2024, International Journal of Research Publication and Reviews.

[24]  Ali Serhan Koyuncugil,et al.  Developing Road Maps for Financial Decision Making by CHAID Decision Tree: CHAID Decision Tree Application , 2009, 2009 International Conference on Information Management and Engineering.

[25]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[26]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[27]  Ingoo Han,et al.  Hybrid neural network models for bankruptcy predictions , 1996, Decis. Support Syst..

[28]  G. Annas HIPAA regulations - a new era of medical-record privacy? , 2003, The New England journal of medicine.

[29]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[30]  M. Mahendran,et al.  A survey on Privacy Preserving Data Mining , 2012 .

[31]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[32]  S. Thacker HIPAA Privacy Rule and Public Health , 2003 .

[33]  Arash Ghanbari,et al.  Integration of genetic fuzzy systems and artificial neural networks for stock price forecasting , 2010, Knowl. Based Syst..

[34]  Ann Lehman,et al.  JMP start statistics : a guide to statistics and data analysis using JMP , 2012 .

[35]  Ying Zhang,et al.  A method for real-time peer-to-peer traffic classification based on C4.5 , 2010, 2010 IEEE 12th International Conference on Communication Technology.

[36]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[37]  Evgeny A. Antipov,et al.  Applying CHAID for logistic regression diagnostics and classification accuracy improvement , 2010 .

[38]  Frederick Livingston,et al.  Implementation of Breiman's Random Forest Machine Learning Algorithm , 2005 .

[39]  Latanya Sweeney,et al.  Guaranteeing anonymity when sharing medical data, the Datafly System , 1997, AMIA.

[40]  Melody Y. Kiang,et al.  Managerial Applications of Neural Networks: The Case of Bank Failure Predictions , 1992 .

[41]  Latanya Sweeney,et al.  Datafly: A System for Providing Anonymity in Medical Data , 1997, DBSec.

[42]  Martin T. Hagan,et al.  Neural network design , 1995 .

[43]  Anna C. Davis,et al.  Disparities in CD4+ T-Lymphocyte Monitoring Among Human Immunodeficiency Virus-Positive Medicaid Beneficiaries: Evidence of Differential Treatment at the Point of Care , 2014, Open forum infectious diseases.

[44]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[45]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[46]  Yiming Yang,et al.  Text categorization , 2008, Scholarpedia.

[47]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[49]  Asha Gowda Karegowda,et al.  Rule based Classification for Diabetic Patients using Cascaded K-Means and Decision Tree C4.5 , 2012 .

[50]  B. Wettermark,et al.  Factors associated with concordance between parental‐reported use and dispensed asthma drugs in adolescents: findings from the BAMSE birth cohort , 2014, Pharmacoepidemiology and drug safety.

[51]  L. Sweeney Computational Disclosure Control for Medical Microdata , 1997 .