Privacy-preserving decision tree for epistasis detection

The interaction between gene loci, namely epistasis, is a widespread biological genetic phenomenon. In genome-wide association studies(GWAS), epistasis detection of complex diseases is a major challenge. Although many approaches using statistics, machine learning, and information entropy were proposed for epistasis detection, the privacy preserving for single nucleotide polymorphism(SNP) data has been largely ignored. Thus, this paper proposes a novel two-stage approach. A fusion strategy assists in combining and sorting the SNPs importance scores obtained by the relief and mutual information, thereby obtaining a candidate set of SNPs. This avoids missing some SNPs with strong interaction. Furthermore, differentially private decision tree is applied to search for SNPs. This achieves the efficient epistasis detection of complex diseases on the basis of privacy preserving compared with heuristic methods. The recognition rate on simulation data set is more than 90%. Also, several susceptible loci including rs380390 and rs1329428 are found in the real data set for Age-related Macular Degeneration (AMD). This demonstrates that our method is promising in epistasis detection.

[1]  C. H. Camargo,et al.  Species distribution and susceptibility profile of Candida species in a Brazilian public tertiary hospital , 2010, BMC Research Notes.

[2]  Carl A. Gunter,et al.  Privacy in the Genomic Era , 2014, ACM Comput. Surv..

[3]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[4]  Chengqi Zhang,et al.  Exploring Consensus RNA Substructural Patterns Using Subgraph Mining , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Arlindo L. Oliveira,et al.  Using Information Interaction to Discover Epistatic Effects in Complex Diseases , 2013, PloS one.

[6]  Yi Wang,et al.  Exploration of gene–gene interaction effects using entropy-based methods , 2008, European Journal of Human Genetics.

[7]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[8]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[9]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[10]  Hui Li,et al.  [Current status of SNPs interaction in genome-wide association study]. , 2011, Yi chuan = Hereditas.

[11]  Md Zahidul Islam,et al.  A Differentially Private Decision Forest , 2015, AusDM.

[12]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[13]  Romdhane Rekaya,et al.  AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm , 2010, BMC Research Notes.

[14]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[15]  Qiang Yang,et al.  Predictive rule inference for epistatic interaction detection in genome-wide association studies , 2010, Bioinform..

[16]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[17]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[18]  Xiong Li,et al.  Informative SNPs Selection Based on Two-Locus and Multilocus Linkage Disequilibrium: Criteria of Max-Correlation and Min-Redundancy , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Philip S. Yu,et al.  Differentially Private Data Publishing and Analysis: A Survey , 2017, IEEE Transactions on Knowledge and Data Engineering.

[20]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[21]  R. Jiang,et al.  Epistatic Module Detection for Case-Control Studies: A Bayesian Model with a Gibbs Sampling Strategy , 2009, PLoS genetics.

[22]  Xiang Zhang,et al.  TEAM: efficient two-locus epistasis tests in human genome-wide association study , 2010, Bioinform..

[23]  Marylyn D Ritchie,et al.  Comparison of approaches for machine‐learning optimization of neural networks for detecting gene‐gene interactions in genetic epidemiology , 2008, Genetic epidemiology.

[24]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[25]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[26]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[27]  Guimei Liu,et al.  Response: an empirical comparison of several recent epistatic interaction detection methods , 2012, Bioinform..

[28]  Sean K. Simmons,et al.  Enabling Privacy-Preservi ng GWASs in Heterogeneous Human Populations Graphical Abstract Highlights , 2016 .

[29]  Stephen E. Fienberg,et al.  Privacy-Preserving Data Sharing for Genome-Wide Association Studies , 2012, J. Priv. Confidentiality.

[30]  Min-Seok Kwon,et al.  A Modified Entropy-Based Approach for Identifying Gene-Gene Interactions in Case-Control Study , 2013, PloS one.

[31]  Yaniv Erlich,et al.  Routes for breaching and protecting genetic privacy , 2013 .

[32]  Tianqing Zhu,et al.  An Effective Deferentially Private Data Releasing Algorithm for Decision Tree , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[33]  園田 茂,et al.  5.Classification and Regression Trees(CART)による脳卒中患者の退院時ADL予測(脳卒中-ADL予測) , 1995 .

[34]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[35]  Chengqi Zhang,et al.  Interval-Based Similarity for Classifying Conserved RNA Secondary Structures , 2016, IEEE Intelligent Systems.

[36]  Yang Liu,et al.  Collaborative Security , 2015, ACM Comput. Surv..

[37]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[38]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[39]  Chengqi Zhang,et al.  Using propensity scores to predict the kinases of unannotated phosphopeptides , 2017, Knowl. Based Syst..

[40]  Stephen E. Fienberg,et al.  Differentially-Private Logistic Regression for Detecting Multiple-SNP Association in GWAS Databases , 2014, Privacy in Statistical Databases.

[41]  Qingfeng Chen,et al.  Recent advances in sequence assembly: principles and applications , 2017, Briefings in functional genomics.

[42]  Jianjun Hu,et al.  Integrative disease classification based on cross-platform microarray data , 2009, BMC Bioinformatics.

[43]  Qiang Yang,et al.  SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies , 2009, Bioinform..

[44]  Bonnie Berger,et al.  Realizing privacy preserving genome-wide association studies , 2016, Bioinform..

[45]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[46]  Guimei Liu,et al.  An empirical comparison of several recent epistatic interaction detection methods , 2011, Bioinform..

[47]  Richard A. Berk Classification and Regression Trees (CART) , 2008 .