Data mining approaches for genome-wide association of mood disorders

Background Mood disorders are highly heritable forms of major mental illness. A major breakthrough in elucidating the genetic architecture of mood disorders was anticipated with the advent of genome-wide association studies (GWAS). However, to date few susceptibility loci have been conclusively identified. The genetic etiology of mood disorders appears to be quite complex, and as a result, alternative approaches for analyzing GWAS data are needed. Recently, a polygenic scoring approach that captures the effects of alleles across multiple loci was successfully applied to the analysis of GWAS data in schizophrenia and bipolar disorder (BP). However, this method may be overly simplistic in its approach to the complexity of genetic effects. Data mining methods are available that may be applied to analyze the high dimensional data generated by GWAS of complex psychiatric disorders. Results We sought to compare the performance of five data mining methods, namely, Bayesian networks, support vector machine, random forest, radial basis function network, and logistic regression, against the polygenic scoring approach in the analysis of GWAS data on BP. The different classification methods were trained on GWAS datasets from the Bipolar Genome Study (2191 cases with BP and 1434 controls) and their ability to accurately classify case/control status was tested on a GWAS dataset from the Wellcome Trust Case Control Consortium. Conclusion The performance of the classifiers in the test dataset was evaluated by comparing area under the receiver operating characteristic curves. Bayesian networks performed the best of all the data mining classifiers, but none of these did significantly better than the polygenic score approach. We further examined a subset of single-nucleotide polymorphisms (SNPs) in genes that are expressed in the brain, under the hypothesis that these might be most relevant to BP susceptibility, but all the classifiers performed worse with this reduced set of SNPs. The discriminative accuracy of all of these methods is unlikely to be of diagnostic or clinical utility at the present time. Further research is needed to develop strategies for selecting sets of SNPs likely to be relevant to disease susceptibility and to determine if other data mining classifiers that utilize other algorithms for inferring relationships among the sets of SNPs may perform better.

[1]  Kwan-Liu Ma,et al.  Machine Learning to Boost the Next Generation of Visualization Technology , 2007, IEEE Computer Graphics and Applications.

[2]  Sorin Draghici,et al.  Machine Learning and Its Applications to Biology , 2007, PLoS Comput. Biol..

[3]  Sun-Mi Lee,et al.  Bayesian networks for knowledge discovery in large datasets: basics for nurse researchers , 2003, J. Biomed. Informatics.

[4]  D. Geschwind,et al.  Functional and Evolutionary Insights into Human Brain Development through Global Transcriptome Analysis , 2009, Neuron.

[5]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[6]  Zhaoxia Yu,et al.  Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. , 2009, American journal of human genetics.

[7]  Andrew G. Clark,et al.  Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels , 2005, J. Comput. Biol..

[8]  Aixia Yan Application of self-organizing maps in compounds pattern recognition and combinatorial library design. , 2006, Combinatorial chemistry & high throughput screening.

[9]  S. Gabriel,et al.  Whole-genome association study of bipolar disorder , 2008, Molecular Psychiatry.

[10]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[11]  Gisbert Schneider,et al.  Support vector machine applications in bioinformatics. , 2003, Applied bioinformatics.

[12]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[13]  Paola Sebastiani,et al.  Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia , 2005, Nature Genetics.

[14]  John B. O. Mitchell,et al.  Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction , 2008, Chemistry Central journal.

[15]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[16]  Jianxin Shi,et al.  Genome-wide linkage and follow-up association study of postpartum mood symptoms. , 2009, The American journal of psychiatry.

[17]  F. Rouillon,et al.  [Epidemiology of mood disorders]. , 2008, La Revue du praticien.

[18]  K. Botteron,et al.  Etiology and genetics of early-onset mood disorders. , 2002, Child and adolescent psychiatric clinics of North America.

[19]  William Stafford Noble,et al.  Support vector machine classification on the web , 2004, Bioinform..

[20]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[21]  D. Cozzetto,et al.  Advances and pitfalls in protein structure prediction. , 2008, Current protein & peptide science.

[22]  Nicholas G Martin,et al.  Genome-wide linkage analysis of multiple measures of neuroticism of 2 large cohorts from Australia and the Netherlands. , 2008, Archives of general psychiatry.

[23]  J. Nurnberger,et al.  Diagnostic interview for genetic studies. Rationale, unique features, and training. NIMH Genetics Initiative. , 1994, Archives of general psychiatry.

[24]  Yanxiong Peng,et al.  A Hybrid Approach for Biomarker Discovery from Microarray Gene Expression Data for Cancer Classification , 2006 .

[25]  Nobuyoshi Sugaya,et al.  Assessing the druggability of protein-protein interactions by a supervised machine-learning method , 2009, BMC Bioinformatics.

[26]  John P. Rice,et al.  Genome-wide association study of bipolar disorder in European American and African American individuals , 2009, Molecular Psychiatry.

[27]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[28]  Francis J McMahon,et al.  Genomewide linkage analyses of bipolar disorder: a new sample of 250 pedigrees from the National Institute of Mental Health Genetics Initiative. , 2003, American journal of human genetics.

[29]  András Kocsor,et al.  A Protein Classification Benchmark collection for machine learning , 2007, Nucleic Acids Res..

[30]  Andrew J. Bulpitt,et al.  A Primer on Learning in Bayesian Networks for Computational Biology , 2007, PLoS Comput. Biol..

[31]  Jianxin Shi,et al.  No significant association of 14 candidate genes with schizophrenia in a large European ancestry sample: implications for psychiatric genetics. , 2008, The American journal of psychiatry.

[32]  Mo Jamshidi,et al.  Tools for intelligent control: fuzzy controllers, neural networks and genetic algorithms , 2003, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[33]  Francis J McMahon,et al.  Familiality of polarity at illness onset in bipolar affective disorder. , 2006, The American journal of psychiatry.

[34]  G. Sachs,et al.  Decision tree for the treatment of bipolar disorder. , 2003, The Journal of clinical psychiatry.

[35]  P. Donnelly,et al.  New models of collaboration in genome-wide association studies: the Genetic Association Information Network , 2007, Nature Genetics.

[36]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[37]  Louise C. Showe,et al.  Bioinformatics Original Paper Combining Multi-species Genomic Data for Microrna Identification Using a Naı¨ve Bayes Classifier , 2022 .

[38]  Misha Tsodyks,et al.  Neural networks and perceptual learning , 2004, Nature.

[39]  Eugene Lin,et al.  A comparison of classification methods for predicting Chronic Fatigue Syndrome based on genetic data , 2009, Journal of Translational Medicine.

[40]  Manuel A. R. Ferreira,et al.  Collaborative genome-wide association analysis supports a role for ANK3 and CACNA1C in bipolar disorder , 2008, Nature Genetics.