Incorporating Biological Domain Knowledge into Cluster Validity Assessment

This paper presents an approach for assessing cluster validity based on similarity knowledge extracted from the Gene Ontology (GO) and databases annotated to the GO. A knowledge-driven cluster validity assessment system for microarray data was implemented. Different methods were applied to measure similarity between yeast genes products based on the GO. This research proposes two methods for calculating cluster validity indices using GO-driven similarity. The first approach processes overall similarity values, which are calculated by taking into account the combined annotations originating from the three GO hierarchies. The second approach is based on the calculation of GO hierarchy-independent similarity values, which originate from each of these hierarchies. A traditional node-counting method and an information content technique have been implemented to measure knowledge-based similarity between genes products (biological distances). The results contribute to the evaluation of clustering outcomes and the identification of optimal cluster partitions, which may represent an effective tool to support biomedical knowledge discovery in gene expression data analysis.

[1]  Olivier Bodenreider,et al.  Incorporating ontology-driven similarity knowledge into functional genomics: an exploratory study , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[2]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[3]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[4]  Ted Briscoe,et al.  32nd Annual Meeting of the Association for Computational Linguistics, 27-30 June 1994, New Mexico State University, Las Cruces, New Mexico, USA, Proceedings , 1994, ACL.

[5]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[6]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[7]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[8]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[9]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[10]  J. Fitch,et al.  Genomic engineering: moving beyond DNA sequence to function , 2000, Proceedings of the IEEE.

[11]  Russ B. Altman,et al.  A literature-based method for assessing the functional coherence of a gene group , 2003, Bioinform..

[12]  Francisco Azuaje,et al.  Machaon CVE: cluster validation for gene expression data , 2003, Bioinform..

[13]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[14]  A. Zell,et al.  Functional grouping of genes using spectral clustering and Gene Ontology , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[15]  Junguk Hur,et al.  A graph-theoretic modeling on GO space for biological interpretation of gene clusters , 2004, Bioinform..

[16]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[17]  Daniel Hanisch,et al.  New methods for joint analysis of biological networks and expression data , 2004, German Conference on Bioinformatics.

[18]  Andreas Zell,et al.  A memetic clustering algorithm for the functional partition of genes based on the gene ontology , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[19]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[20]  Francisco Azuaje,et al.  A knowledge-driven approach to cluster validity assessment , 2005, Bioinform..