Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology

Identifying disease genes from a vast amount of genetic data is one of the most challenging tasks in the post-genomic era. Also, complex diseases present highly heterogeneous genotype, which difficult biological marker identification. Machine learning methods are widely used to identify these markers, but their performance is highly dependent upon the size and quality of available data. In this study, we demonstrated that machine learning classifiers trained on gene functional similarities, using Gene Ontology (GO), can improve the identification of genes involved in complex diseases. For this purpose, we developed a supervised machine learning methodology to predict complex disease genes. The proposed pipeline was assessed using Autism Spectrum Disorder (ASD) candidate genes. A quantitative measure of gene functional similarities was obtained by employing different semantic similarity measures. To infer the hidden functional similarities between ASD genes, various types of machine learning classifiers were built on quantitative semantic similarity matrices of ASD and non-ASD genes. The classifiers trained and tested on ASD and non-ASD gene functional similarities outperformed previously reported ASD classifiers. For example, a Random Forest (RF) classifier achieved an AUC of 0. 80 for predicting new ASD genes, which was higher than the reported classifier (0.73). Additionally, this classifier was able to predict 73 novel ASD candidate genes that were were enriched for core ASD phenotypes, such as autism and obsessive-compulsive behavior. In addition, predicted genes were also enriched for ASD co-occurring conditions, including Attention Deficit Hyperactivity Disorder (ADHD). We also developed a KNIME workflow with the proposed methodology which allows users to configure and execute it without requiring machine learning and programming skills. Machine learning is an effective and reliable technique to decipher ASD mechanism by identifying novel disease genes, but this study further demonstrated that their performance can be improved by incorporating a quantitative measure of gene functional similarities. Source code and the workflow of the proposed methodology are available at https://github.com/Muh-Asif/ASD-genes-prediction.

[1]  Mário J. Silva,et al.  Disjunctive shared information between ontology concepts: application to Gene Ontology , 2011, J. Biomed. Semant..

[2]  Duc-Hau Le,et al.  GPEC: A Cytoscape plug-in for random walk-based gene prioritization and biomedical evidence collection , 2012, Comput. Biol. Chem..

[3]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[4]  Kostas Marias,et al.  Microarray Image Denoising Using a Two-Stage Multiresolution Technique , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[5]  C. Spencer,et al.  Biological Insights From 108 Schizophrenia-Associated Genetic Loci , 2014, Nature.

[6]  Ping Luo,et al.  Identifying disease genes from PPI networks weighted by gene expression under different conditions , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[7]  Stephen J. Guter,et al.  Convergence of Genes and Cellular Pathways Dysregulated in Autism Spectrum Disorders , 2014, American journal of human genetics.

[8]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[9]  D. Geschwind,et al.  Gene hunting in autism spectrum disorder: on the path to precision medicine , 2015, The Lancet Neurology.

[10]  Y. Leitner,et al.  The Co-Occurrence of Autism and Attention Deficit Hyperactivity Disorder in Children – What Do We Know? , 2014, Front. Hum. Neurosci..

[11]  P. Radivojac,et al.  An integrated approach to inferring gene–disease associations in humans , 2008, Proteins.

[12]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[13]  Xue-wen Chen,et al.  Human Disease-Gene Classification with Integrative Sequence-Based and Topological Features of Protein-Protein Interaction Networks , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[14]  William Stafford Noble,et al.  Support vector machine , 2013 .

[15]  V. Eapen,et al.  Converging Pathways in Autism Spectrum Disorders: Interplay between Synaptic Dysfunction and Immune Responses , 2013, Front. Hum. Neurosci..

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  Christopher S. Poultney,et al.  Synaptic, transcriptional, and chromatin genes disrupted in autism , 2014, Nature.

[18]  Francisco M. Couto,et al.  Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules , 2017, BioMed research international.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  Nguyen Xuan Hoai,et al.  A Comparative Study of Classification-Based Machine Learning Methods for Novel Disease Gene Prediction , 2014, KSE.

[21]  Bin Liu,et al.  Prioritization of candidate disease genes by combining topological similarity and semantic similarity , 2015, J. Biomed. Informatics.

[22]  Chandra L. Theesfeld,et al.  Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder , 2016, Nature Neuroscience.

[23]  Sharmila Banerjee-Basu,et al.  SFARI Gene 2.0: a community-driven knowledgebase for the autism spectrum disorders (ASDs) , 2013, Molecular Autism.

[24]  Yibo Wu,et al.  GOSemSim: an R package for measuring semantic similarity among GO terms and gene products , 2010, Bioinform..

[25]  Stephan J Sanders First glimpses of the neurobiology of autism spectrum disorder. , 2015, Current opinion in genetics & development.

[26]  Rita M Cantor,et al.  Rare Inherited and De Novo CNVs Reveal Complex Contributions to ASD Risk in Multiplex Families. , 2016, American journal of human genetics.

[27]  C. Wijmenga,et al.  Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. , 2006, American journal of human genetics.

[28]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[29]  Xuequn Shang,et al.  Predicting disease-related genes using integrated biomedical networks , 2017, BMC Genomics.

[30]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[31]  Jeffrey O. Grady Knowledge and Systems Engineering , 2004 .

[32]  Andrew D. Rouillard,et al.  Enrichr: a comprehensive gene set enrichment analysis web server 2016 update , 2016, Nucleic Acids Res..