CDGMiner: A New Tool for the Identification of Disease Genes by Text Mining and Functional Similarity Analysis

In the post-genomic era, the identification of genes involved in human disease is one of the most important tasks. Disease phenotypes provide a window into the gene function. Several approaches to identify disease related genes based on function annotations have been presented in recent years. Most of them, starting from the function annotations of known genes associated with diseases, however, can not be used to identify genes for diseases without any known pathogenic genes or related function annotations. We have built a new system, CDGMiner, to predict genes associated with these diseases which lack detailed function annotations. CDGMiner is implemented mainly by two phases, text mining and functional similarity analysis. The performance of CDGMiner was tested with a set of 1506 genes involved in 1147 disease phenotypes derived from the OMIM database. Our results show that, on average, the target gene was in the top 13.60%, and the target gene was in the top 5% with a 40.70% chance. CDGMiner shows promising performance compared to other existing tools.

[1]  Gert Vriend,et al.  GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases , 2005, Nucleic Acids Res..

[2]  M. McCarthy,et al.  New methods for finding disease-susceptibility genes: impact and potential , 2003, Genome Biology.

[3]  Alan R. Powell,et al.  Integration of text- and data-mining using ontologies successfully selects disease gene candidates , 2005, Nucleic acids research.

[4]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[5]  C. Wijmenga,et al.  Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. , 2006, American journal of human genetics.

[6]  B. Snel,et al.  Predicting disease genes using protein–protein interactions , 2006, Journal of Medical Genetics.

[7]  Kenneth H. Buetow,et al.  Gene functional similarity search tool (GFSST) , 2006, BMC Bioinformatics.

[8]  David J. Porteous,et al.  Speeding disease gene discovery by sequence based candidate prioritization , 2005, BMC Bioinformatics.

[9]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[10]  Francesco Pinciroli,et al.  GFINDer: genetic disease and phenotype location statistical analysis and mining of dynamically annotated gene lists , 2005, Nucleic Acids Res..

[11]  David Valle,et al.  Human disease genes , 2001, Nature.

[12]  David J. Porteous,et al.  SUSPECTS : enabling fast and effective prioritization of positional candidates , 2005 .

[13]  Yongjin Li,et al.  Discovering disease-genes by topological features in human protein-protein interaction network , 2006, Bioinform..

[14]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[15]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.