Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data

BackgroundMultifactorial diseases arise from complex patterns of interaction between a set of genetic traits and the environment. To fully capture the genetic biomarkers that jointly explain the heritability component of a disease, thus, all SNPs from a genome-wide association study should be analyzed simultaneously.ResultsIn this paper, we present Bag of Naïve Bayes (BoNB), an algorithm for genetic biomarker selection and subjects classification from the simultaneous analysis of genome-wide SNP data. BoNB is based on the Naïve Bayes classification framework, enriched by three main features: bootstrap aggregating of an ensemble of Naïve Bayes classifiers, a novel strategy for ranking and selecting the attributes used by each classifier in the ensemble and a permutation-based procedure for selecting significant biomarkers, based on their marginal utility in the classification process. BoNB is tested on the Wellcome Trust Case-Control study on Type 1 Diabetes and its performance is compared with the ones of both a standard Naïve Bayes algorithm and HyperLASSO, a penalized logistic regression algorithm from the state-of-the-art in simultaneous genome-wide data analysis.ConclusionsThe significantly higher classification accuracy obtained by BoNB, together with the significance of the biomarkers identified from the Type 1 Diabetes dataset, prove the effectiveness of BoNB as an algorithm for both classification and biomarker selection from genome-wide SNP data.AvailabilitySource code of the BoNB algorithm is released under the GNU General Public Licence and is available at http://www.dei.unipd.it/~sambofra/bonb.html.

[1]  Jason H. Moore,et al.  Missing heritability and strategies for finding the underlying causes of complex disease , 2010, Nature Reviews Genetics.

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[4]  D. Goldstein,et al.  Uncovering the roles of rare variants in common disease through whole-genome sequencing , 2010, Nature Reviews Genetics.

[5]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[6]  H. Cordell,et al.  SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression , 2010, Genetic epidemiology.

[7]  Scott M. Williams,et al.  challenges for genome-wide association studies , 2010 .

[8]  Christian Gieger,et al.  Six new loci associated with body mass index highlight a neuronal influence on body weight regulation , 2009, Nature Genetics.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Paola Sebastiani,et al.  Genome-Wide Association Studies (GWAS) , 2019, Definitions.

[11]  Angelo J. Canty,et al.  A Genome-Wide Association Study Identifies a Novel Major Locus for Glycemic Control in Type 1 Diabetes, as Measured by Both A1C and Glucose , 2009, Diabetes.

[12]  M. McCarthy,et al.  Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes , 2008, Nature Genetics.

[13]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[14]  Robert K. Wayne,et al.  Evolutionary genomics of dog domestication , 2012, Mammalian Genome.

[15]  Ying Wang,et al.  Genomewide association study of leprosy. , 2009, The New England journal of medicine.

[16]  Helen Schuilenburg,et al.  Genome-wide association study and meta-analysis finds over 40 loci affect risk of type 1 diabetes , 2009, Nature Genetics.

[17]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[18]  D. Clayton,et al.  Genome-wide association study and meta-analysis finds over 40 loci affect risk of type 1 diabetes , 2009, Nature Genetics.

[19]  M. Walsh,et al.  An Introduction , 2002, The Counseling Psychologist.

[20]  Justin O Borevitz,et al.  Genome-wide association studies in plants: the missing heritability is in the field , 2011, Genome Biology.

[21]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  Y. Pawitan,et al.  The pursuit of genome-wide association studies: where are we now? , 2010, Journal of Human Genetics.

[24]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[25]  Dirk Van den Poel,et al.  Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB , 2007, DEXA.

[26]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[27]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.