Kernel hierarchical gene clustering from microarray expression data

MOTIVATION Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. RESULTS We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. AVAILABILITY Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[3]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[4]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[5]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[8]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[9]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[10]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[11]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[12]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  L. Wodicka,et al.  Regional and strain-specific gene expression mapping in the adult mouse brain. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[21]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[22]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[23]  Motoaki Kawanabe,et al.  Clustering with the Fisher Score , 2002, NIPS.

[24]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .