Prediction of SNP Sequences via Gini Impurity Based Gradient Boosting Method

Recent research has witnessed the fostered application of machine learning approaches in analyzing the single nucleotide polymorphisms (SNP) data, which has been proved to be implicated in complex human diseases. In the identification of SNPs responsible for complex diseases, most genome-wide association studies always took single SNP into consideration at one time and ignored diverse interactions between SNPs. One of the major problems is the higher number of features and the relatively small number of individuals, which complicates the task and harms the predictive ability of DNA sequences. In this paper, a novel boosting-based ensemble approach was proposed to study these interactions. An importance scoring strategy based on Gini impurity was introduced for feature selection. We evaluated its efficacy on the SNP genotyping data collected by the Southeastern University of China and compared it with naive Bayes, support vector machine, and random forest. The experimental results have shown its validity and effectiveness on SNP interaction identification. In addition, our approach had an obvious advantage of computational time and resources.

[1]  W. Bateson Mendel's Principles of Heredity , 1910, Nature.

[2]  R. Fisher XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. , 1919, Transactions of the Royal Society of Edinburgh.

[3]  Chen Wei-chang DIGITAL CODING OF THE GENETIC CODONS AND DNA SEQUENCESIN HIGH DIMENSION SPACE , 2000 .

[4]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[5]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Alison A Motsinger,et al.  Multifactor dimensionality reduction for detecting gene-gene and gene-environment interactions in pharmacogenomics studies. , 2005, Pharmacogenomics.

[8]  Weidong Mao,et al.  An Optimum Random Forest Model for Prediction of Genetic Susceptibility to Complex Diseases , 2007, PAKDD.

[9]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[10]  Margaret R. Karagas,et al.  Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions , 2008, BMC Bioinformatics.

[11]  Daniel Gianola,et al.  Additive Genetic Variability and the Bayesian Alphabet , 2009, Genetics.

[12]  H. Boezen,et al.  Genome-wide association studies: what do they teach us about asthma and chronic obstructive pulmonary disease? , 2009, Proceedings of the American Thoracic Society.

[13]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[14]  Xiang Zhang,et al.  TEAM: efficient two-locus epistasis tests in human genome-wide association study , 2010, Bioinform..

[15]  Kent A Weigel,et al.  L2-Boosting algorithm applied to high-dimensional problems in genomic selection. , 2010, Genetics research.

[16]  Jing Li,et al.  Detecting epistatic effects in association studies at a genomic level based on an ensemble approach , 2011, Bioinform..

[17]  Hans-Peter Piepho,et al.  A comparison of random forests, boosting and support vector machines for genomic selection , 2011, BMC proceedings.

[18]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[19]  Jing Li,et al.  Feature selections using AdaBoost: Application in gene-gene interaction detection , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops.

[20]  P Hysi,et al.  Gradient Boosting as a SNP Filter: an Evaluation Using Simulated and Hair Morphology Data , 2013, Journal of data mining in genomics & proteomics.

[21]  O. González-Recio,et al.  The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets. , 2013, Journal of dairy science.

[22]  Zhanyu Ma,et al.  A variational Bayes beta Mixture Model for Feature Selection in DNA methylation Studies , 2013, J. Bioinform. Comput. Biol..

[23]  Yuanke Zhang,et al.  EpiMiner: A three-stage co-information based method for detecting and visualizing epistatic interactions , 2014, Digit. Signal Process..

[24]  Jason H. Moore,et al.  Bioinformatics challenges in genome-wide association studies (GWAS). , 2014, Methods in molecular biology.

[25]  Víctor Potenciano,et al.  A comparison of genomic profiles of complex diseases under different models , 2015, BMC Medical Genomics.

[26]  Sunduz Keles,et al.  Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data , 2014, Cancer informatics.

[27]  Honggang Zhang,et al.  Variational Bayesian Matrix Factorization for Bounded Support Data , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Pablo Lamata,et al.  Myocardial Infarction Detection from Left Ventricular Shapes Using a Random Forest , 2015, STACOM@MICCAI.

[29]  Alioune Ngom,et al.  A novel approach for finding informative genes in ten subtypes of breast cancer , 2015, 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[30]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[31]  Junfeng Xia,et al.  CINOEDV: a co-information based method for detecting and visualizing n-order epistatic interactions , 2016, BMC Bioinformatics.

[32]  Wei Luo,et al.  Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View , 2016, Journal of medical Internet research.

[33]  Lei Wang,et al.  Epistasis detection using a permutation-based Gradient Boosting Machine , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[34]  Dewan Md. Farid,et al.  An Ensemble Clustering For Mining High-dimensional Biological Big Data , 2016 .

[35]  F. Ghafouri-Kesbi,et al.  Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic Best Linear Unbiased Prediction in different scenarios of genomic evaluation , 2016 .

[36]  Abbas Mikhchi,et al.  Comparison of three boosting methods in parent-offspring trios for genotype imputation using simulation study , 2016, Journal of animal science and technology.

[37]  Qiang Zhang,et al.  Risk prediction of type II diabetes based on random forest model , 2017, 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB).

[38]  Yi Xiong,et al.  PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. , 2017, Journal of theoretical biology.

[39]  Antonio Reverter,et al.  Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods , 2018, Front. Genet..

[40]  Faramarz Dorani,et al.  Ensemble learning for detecting gene-gene interactions in colorectal cancer , 2018, PeerJ.

[41]  Hamid Behravan,et al.  Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls , 2018, Scientific Reports.