Integration of multiple data sources to prioritize candidate genes using discounted rating system

BackgroundIdentifying disease gene from a list of candidate genes is an important task in bioinformatics. The main strategy is to prioritize candidate genes based on their similarity to known disease genes. Most of existing gene prioritization methods access only one genomic data source, which is noisy and incomplete. Thus, there is a need for the integration of multiple data sources containing different information.ResultsIn this paper, we proposed a combination strategy, called discounted rating system (DRS). We performed leave one out cross validation to compare it with N-dimensional order statistics (NDOS) used in Endeavour. Results showed that the AUC (Area Under the Curve) values achieved by DRS were comparable with NDOS on most of the disease families. But DRS worked much faster than NDOS, especially when the number of data sources increases. When there are 100 candidate genes and 20 data sources, DRS works more than 180 times faster than NDOS. In the framework of DRS, we give different weights for different data sources. The weighted DRS achieved significantly higher AUC values than NDOS.ConclusionThe proposed DRS algorithm is a powerful and effective framework for candidate gene prioritization. If weights of different data sources are proper given, the DRS algorithm will perform better.

[1]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[2]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[3]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[4]  Frances S. Turner,et al.  POCUS: mining genomic sequence annotation to predict disease genes , 2003, Genome Biology.

[5]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[6]  Jagdish Chandra Patra,et al.  A New Method to Combine Heterogeneous Data Sources for Candidate Gene Prioritization , 2009, 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering.

[7]  J. Nadeau,et al.  Finding Genes That Underlie Complex Traits , 2002, Science.

[8]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[9]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[10]  Jan Freudenberg,et al.  A similarity-based method for genome-wide prediction of disease-relevant human genes , 2002, ECCB.

[11]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[12]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[13]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[14]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[15]  Bart De Moor,et al.  Endeavour update: a web resource for gene prioritization in multiple species , 2008, Nucleic Acids Res..

[16]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[17]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[18]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.

[19]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[20]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[21]  Paul Pavlidis,et al.  Gene Ontology term overlap as a measure of gene functional similarity , 2008, BMC Bioinformatics.

[22]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[23]  David J. Porteous,et al.  SUSPECTS : enabling fast and effective prioritization of positional candidates , 2005 .

[24]  Yongjin Li,et al.  Discovering disease-genes by topological features in human protein-protein interaction network , 2006, Bioinform..