Association of genes to genetically inherited diseases using data mining

Although approximately one-quarter of the roughly 4,000 genetically inherited diseases currently recorded in respective databases (LocusLink, OMIM) are already linked to a region of the human genome, about 450 have no known associated gene. Finding disease-related genes requires laborious examination of hundreds of possible candidate genes (sometimes, these are not even annotated; see, for example, refs 3,4). The public availability of the human genome draft sequence has fostered new strategies to map molecular functional features of gene products to complex phenotypic descriptions, such as those of genetically inherited diseases. Owing to recent progress in the systematic annotation of genes using controlled vocabularies, we have developed a scoring system for the possible functional relationships of human genes to 455 genetically inherited diseases that have been mapped to chromosomal regions without assignment of a particular gene. In a benchmark of the system with 100 known disease-associated genes, the disease-associated gene was among the 8 best-scoring genes with a 25% chance, and among the best 30 genes with a 50% chance, showing that there is a relationship between the score of a gene and its likelihood of being associated with a particular disease. The scoring also indicates that for some diseases, the chance of identifying the underlying gene is higher.

[1]  Hans-Jürgen Zimmermann,et al.  Fuzzy Set Theory - and Its Applications , 1985 .

[2]  P. Shashidharan,et al.  Glutamate Dehydrogenase Deficiency in Cerebellar Degenerations: Clinical, Biochemical and Molecular Genetic Aspects , 1993, Canadian Journal of Neurological Sciences / Journal Canadien des Sciences Neurologiques.

[3]  H. Zimmermann,et al.  Fuzzy Set Theory and Its Applications , 1993 .

[4]  H.-J. Zimmermann,et al.  Fuzzy set theory—and its applications (3rd ed.) , 1996 .

[5]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[6]  E. Lander,et al.  Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999 .

[7]  D. Valle,et al.  Online Mendelian Inheritance In Man (OMIM) , 2000, Human mutation.

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[10]  S. Batalov,et al.  A Comparison of the Celera and Ensembl Predicted Gene Sets Reveals Little Overlap in Novel Genes , 2001, Cell.

[11]  Shawn K. Westaway,et al.  A novel pantothenate kinase gene (PANK2) is defective in Hallervorden-Spatz syndrome , 2001, Nature Genetics.

[12]  Jonathan C. Cohen,et al.  Autosomal Recessive Hypercholesterolemia Caused by Mutations in a Putative LDL Receptor Adaptor Protein , 2001, Science.

[13]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[14]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.