Exploiting Negative Sample Selection for Prioritizing Candidate Disease Genes

A major challenge in bio-medicine is finding the genetic causes of human diseases, and researchers are often faced with a large number of candidate genes. Gene prioritization methods provide a valuable support in guiding researchers to detect reliable candidate causative-genes for a disease under study. Indeed, such methods rank genes according to their association with a disease of interest. Actually, the majority of genetic disorders has few or none causative genes associated with them; this induces a high labeling unbalance in the corresponding ranking problems, thus linking the need of achieving reliable solutions to the adoption of imbalance-aware techniques. We propose the use of an expressly designed imbalance-aware methodology for prioritizing genes, which first rebalances the training set entries through a negative selection procedure, then applies a learning algorithm 'sensitive' to the misclassification of positive instances, to provide the gene ranking. The algorithm has a reduced time complexity, which makes feasible its application on large-sized datasets. The validation of this methodology proved its competitiveness with state-of-art techniques on a benchmark composed of 708 selected Medical Subject Headings diseases, and provided some putative novel gene-disease associations.

[1]  Yana Bromberg,et al.  Chapter 15: Disease Gene Prioritization , 2013, PLoS Comput. Biol..

[2]  D. Koller,et al.  A module map showing conditional activity of expression modules in cancer , 2004, Nature Genetics.

[3]  Miguel A. Andrade-Navarro,et al.  Génie: literature-based gene prioritization at multi genomic scale , 2011, Nucleic Acids Res..

[4]  Carl Kingsford,et al.  The power of protein interaction networks for associating genes with diseases , 2010, Bioinform..

[5]  Giulio Pavesi,et al.  A neural network based algorithm for gene expression prediction from chromatin structure , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[6]  Thomas Schlitt,et al.  From SNPs to Genes: Disease Association at the Gene Level , 2011, PloS one.

[7]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[8]  Giorgio Valentini,et al.  COSNet: An R package for label prediction in unbalanced biological networks , 2017, Neurocomputing.

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[11]  A. Agresti Foundations of Linear and Generalized Linear Models , 2015 .

[12]  Ricardo J. G. B. Campello,et al.  A fuzzy extension of the silhouette width criterion for cluster analysis , 2006, Fuzzy Sets Syst..

[13]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[14]  Carolina Perez-Iratxeta,et al.  Prioritization of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes. , 2008, Physiological genomics.

[15]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[16]  Emily H Turner,et al.  Target-enrichment strategies for next-generation sequencing , 2010, Nature Methods.

[17]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[18]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[19]  Dario Malchiodi,et al.  Selection of Negative Examples for Node Label Prediction Through Fuzzy Clustering Techniques , 2015, Advances in Neural Networks.

[20]  Marylyn D. Ritchie,et al.  Pacific Symposium on Biocomputing 14:368-379 (2009) BIOFILTER: A KNOWLEDGE-INTEGRATION SYSTEM FOR THE MULTI-LOCUS ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES * , 2022 .

[21]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[22]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[23]  Thomas C. Wiegers,et al.  Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks , 2008, Nucleic Acids Res..

[24]  R. Reading,et al.  Diagnostic exome sequencing in persons with severe intellectual disability , 2013 .

[25]  Simone Bassis,et al.  Gene-Disease Prioritization Through Cost-Sensitive Graph-Based Methodologies , 2016, IWBBIO.

[26]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[27]  Giorgio Valentini,et al.  An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods , 2014, Artif. Intell. Medicine.

[28]  L. Stein,et al.  A human functional protein interaction network and its application to cancer data analysis , 2010, Genome Biology.

[29]  Christie S. Chang,et al.  The BioGRID interaction database: 2013 update , 2012, Nucleic Acids Res..

[30]  Xiaohui S. Xie,et al.  Disease gene discovery through integrative genomics. , 2005, Annual review of genomics and human genetics.

[31]  Rui Jiang,et al.  Prioritization Of Nonsynonymous Single Nucleotide Variants For Exome Sequencing Studies Via Integrative Learning On Multiple Genomic Data , 2015, Scientific Reports.

[32]  Catherine Daly,et al.  GeneTIER: prioritization of candidate disease genes using tissue-specific gene expression profiles , 2015, Bioinform..

[33]  Ian M. Carr,et al.  OVA: integrating molecular and physical phenotype data from multiple biomedical domain ontologies with variant filtering for enhanced variant prioritization , 2015, Bioinform..

[34]  Alexandre P. Francisco,et al.  Interactogeneous: Disease Gene Prioritization Using Heterogeneous Networks and Full Topology Scores , 2012, PloS one.

[35]  Ottar Hellevik,et al.  Linear versus logistic regression when the dependent variable is a dichotomy , 2009 .

[36]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[37]  Roded Sharan,et al.  A Propagation-based Algorithm for Inferring Gene-Disease Assocations , 2008, German Conference on Bioinformatics.

[38]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[39]  Lucas C. Parra,et al.  Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds , 2010, J. Mach. Learn. Res..

[40]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[41]  Bart De Moor,et al.  An unbiased evaluation of gene prioritization tools , 2012, Bioinform..

[42]  Marco Frasca,et al.  Automated gene function prediction through gene multifunctionality in biological networks , 2015, Neurocomputing.

[43]  Giorgio Valentini,et al.  UNIPred: Unbalance-Aware Network Integration and Prediction of Protein Functions , 2015, J. Comput. Biol..

[44]  Giorgio Valentini,et al.  A neural network algorithm for semi-supervised node label learning from unbalanced data , 2013, Neural Networks.

[45]  Peter N. Robinson,et al.  Phenotype-driven strategies for exome prioritization of human Mendelian disease genes , 2015, Genome Medicine.