Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures

Clustering of genes on the basis of expression profiles is generally the first step in understanding how a class of genes behaves in a biological process. A number of supervised and unsupervised algorithms are available in statistics and machine learning literature for clustering microarray data, but the algorithms are restricted in their ability to evaluate the results of a clustering algorithm in the light of biologically meaningful clusters. If two gene sequences are similar, then we would expect that their genetic expressions are similar and that they are similarly annotated in the Gene Ontology (GO) databases. Hence a comparison of the expression level similarity of two gene sequences against their corresponding similarity of annotation in the GO can establish this fact. Semantic similarity has now become a valuable tool for validating the results drawn from biomedical studies such as gene clustering and gene expression data analysis. This paper borrows from our previous work on meta-ensembles using cancer datasets where the output of several clustering algorithms are subsequently fed to a consensus building process to generate a stable set of cluster results. Next, these cluster results are further refined through a sequence of biological validation process for each gene pair of a given cluster using semantic similarity and sequence similarity. We have tested our approach on several benchmark cancer datasets in an attempt to provide a more accurate biological analysis of the clusters and the results have been found to be satisfactory.

[1]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[2]  Ying Xu,et al.  Prediction of functional modules based on comparative genome analysis and Gene Ontology application , 2005, Nucleic acids research.

[3]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[4]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[5]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6]  Michael A. Siani-Rose,et al.  A Knowledge-Based Clustering Algorithm Driven by Gene Ontology , 2004, Journal of biopharmaceutical statistics.

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Xiaomei Wu,et al.  Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations , 2006, Nucleic acids research.

[9]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[10]  J. Dopazo,et al.  Phylogenetic Reconstruction Using an Unsupervised Growing Neural Network That Adopts the Topology of a Phylogenetic Tree , 1997, Journal of Molecular Evolution.

[11]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[12]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[14]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[15]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[16]  Carole A. Goble,et al.  Semantic Similarity Measures as Tools for Exploring the Gene Ontology , 2002, Pacific Symposium on Biocomputing.

[17]  Pavel V. Baranov,et al.  DARNED: a DAtabase of RNa EDiting in humans , 2010, Bioinform..

[18]  Teuvo Kohonen,et al.  Self-Organizing Maps, Second Edition , 1997, Springer Series in Information Sciences.

[19]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[20]  Olivier Bodenreider,et al.  Incorporating ontology-driven similarity knowledge into functional genomics: an exploratory study , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[21]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[22]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[23]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[24]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[25]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[26]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[27]  Huiru Zheng,et al.  seGOsa: Software environment for gene ontology-driven similarity assessment , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[28]  Lee Aaron Newberg Memory-efficient dynamic programming backtrace and pairwise local sequence alignment , 2008, Bioinform..

[29]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[30]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[31]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[32]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[33]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[34]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[35]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[36]  Angel Rubio,et al.  Correlation between Gene Expression and GO Semantic Similarity , 2005, TCBB.

[37]  Safaai Deris,et al.  A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences , 2008, J. Biomed. Informatics.

[38]  Yibo Wu,et al.  GOSemSim: an R package for measuring semantic similarity among GO terms and gene products , 2010, Bioinform..

[39]  Dhruba Kumar Bhattacharyya,et al.  Classification of microarray cancer data using ensemble approach , 2013, Network Modeling Analysis in Health Informatics and Bioinformatics.

[40]  Susmita Datta,et al.  Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes , 2006, BMC Bioinformatics.

[41]  Zheng Guo,et al.  Broadly predicting specific gene functions with expression similarity and taxonomy similarity. , 2005, Gene.

[42]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[43]  Xiang Li,et al.  DOSim: An R package for similarity between diseases based on Disease Ontology , 2011, BMC Bioinformatics.

[44]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[45]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[46]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[47]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[48]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[49]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[50]  Robert M Greene,et al.  Neural crest and mesoderm lineage-dependent gene expression in orofacial development. , 2007, Differentiation; research in biological diversity.

[51]  Pedro M. Coutinho,et al.  Implementation of a Functional Semantic Similarity Measure between Gene-Products , 2003 .

[52]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[53]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[54]  Mário J. Silva,et al.  Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors , 2005, CIKM '05.

[55]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[56]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[57]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[58]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[59]  Olivier Bodenreider,et al.  An ontology-driven clustering method for supporting gene expression analysis , 2005, 18th IEEE Symposium on Computer-Based Medical Systems (CBMS'05).

[60]  Steffen Staab,et al.  Taxonomy Learning - Factoring the Structure of a Taxonomy into a Semantic Classification Decision , 2002, COLING.

[61]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.