BackgroundBiological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function.ResultsTwo automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations.ConclusionAutomated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.
[1]
David Valle,et al.
Human disease genes
,
2001,
Nature.
[2]
Emily Dimmer,et al.
The Gene Ontology Annotation (GOA) Database - An integrated resource of GO annotations to the UniProt Knowledgebase
,
2003,
Silico Biol..
[3]
George Karypis,et al.
Data clustering in life sciences
,
2005,
Molecular biotechnology.
[4]
Carole A. Goble,et al.
Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation
,
2003,
Bioinform..
[5]
Joachim Selbig,et al.
Validation and functional annotation of expression-based clusters based on gene ontology
,
2005,
BMC Bioinformatics.
[6]
Dekang Lin,et al.
An Information-Theoretic Definition of Similarity
,
1998,
ICML.
[7]
D. Barrell,et al.
The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro.
,
2003,
Genome research.
[8]
Philip Resnik,et al.
Using Information Content to Evaluate Semantic Similarity in a Taxonomy
,
1995,
IJCAI.
[9]
Sang Joon Kim,et al.
A Mathematical Theory of Communication
,
2006
.
[10]
Olivier Bodenreider,et al.
Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships
,
2004,
2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.
[11]
Lei Qin,et al.
Semantic search among heterogeneous biological databases based on gene ontology.
,
2004,
Acta biochimica et biophysica Sinica.