Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering

It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes, as well as macroscopic phenotypes of related samples. In orderto simultaneously cluster genes and conditions, we have previously developed a fast coclustering algorithm, minimum sum-squared residue coclustering (MSSRCC), which employs an alternating minimization scheme and generates what we call coclusters in a "checkerboard" structure. In this paper, we propose specific strategies that enable MSSRCC to escape poor local minima and resolve the degeneracy problem in partitional clustering algorithms. The strategies include binormalization, deterministic spectral initialization, and incremental local search. We assess the effects of various strategies on both synthetic gene expression data sets and real human cancer microarrays and provide empirical evidence that MSSRCC with the proposed strategies performs better than existing coclustering and clustering algorithms. In particular, the combination of all the three strategies leads to the best performance. Furthermore, we illustrate coherence of the resulting coclusters in a checkerboard structure, where genes in a cocluster manifest the phenotype structure of corresponding specific samples and evaluate the enrichment of functional annotations in gene ontology (GO).

[1]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[2]  B. Kowalski,et al.  Pattern recognition. Powerful approach to interpreting chemical data , 1972 .

[3]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[4]  David S. Johnson,et al.  The NP-Completeness Column: An Ongoing Guide , 1982, J. Algorithms.

[5]  R. A. Harshman,et al.  Data preprocessing and the extended PARAFAC model , 1984 .

[6]  Paul J. Lewi,et al.  Spectral map analysis: Factorial analysis of contrasts, especially from log ratios , 1989 .

[7]  Desire L. Massart,et al.  Effect of different preprocessing methods for principal component analysis applied to the composition of mixtures: Detection of impurities in HPLC—DAD , 1994 .

[8]  Martin Schader,et al.  A New Algorithm for Two-Mode Clustering , 1996 .

[9]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[10]  Daniel Baier,et al.  Two-Mode Overlapping Clustering With Applications to Simultaneous Benefit Segmentation and Market Structuring , 1997 .

[11]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Yixin Wang,et al.  POWER_SAGE: comparing statistical tests for SAGE experiments , 2000, Bioinform..

[16]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[18]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[19]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[21]  G Rennert,et al.  Organ-specific molecular classification of primary lung, colon, and ovarian adenocarcinomas using gene expression profiles. , 2001, The American journal of pathology.

[22]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[23]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[24]  E. Boerwinkle,et al.  Feature (gene) selection in gene expression-based tumor classification. , 2001, Molecular genetics and metabolism.

[25]  Vichi Maurizio Double k-means Clustering for Simultaneous Classification of Objects and Variables , 2001 .

[26]  Michael Ruogu Zhang,et al.  Molecular characteristics of non-small cell lung cancer , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[27]  I. Mian,et al.  Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. , 2001, Physiological genomics.

[28]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[29]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[30]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[31]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[32]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[34]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[35]  P. F. Macgregor,et al.  Application of microarrays to the analysis of gene expression in cancer. , 2002, Clinical chemistry.

[36]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[37]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[38]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[39]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[40]  Jian Zhang,et al.  Human epithelial cancers secrete immunoglobulin g with unidentified specificity to promote growth and survival of tumor cells. , 2003, Cancer research.

[41]  Geert Molenberghs,et al.  Graphical Exploration of Gene Expression Data: A Comparative Study of Three Multivariate Methods , 2003, Biometrics.

[42]  A. Godwin,et al.  Microarrays in cancer: research and applications. , 2003, BioTechniques.

[43]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[44]  G. Gibson,et al.  Microarray Analysis , 2020, Definitions.

[45]  R. Bro,et al.  Centering and scaling in component analysis , 2003 .

[46]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[47]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[48]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[49]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[50]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[51]  Rasmus Bro,et al.  Multi-way Analysis with Applications in the Chemical Sciences , 2004 .

[52]  Cheng-Wen Wu,et al.  Anti-tumor immunoglobulin M increases lung metastasis in an experimental model of malignant melanoma , 2004, Clinical & Experimental Metastasis.

[53]  Suvrit Sra,et al.  Minimum Sum-Squared Residue based clustering of Gene Expression Data , 2004 .

[54]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[55]  Francisco de A. T. de Carvalho,et al.  Comparative analysis of clustering methods for gene expression time course data , 2004, Genetics and Molecular Biology.

[56]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[57]  Gene H. Golub,et al.  Scaling by Binormalization , 2004, Numerical Algorithms.

[58]  Eckart Zitzler,et al.  An EA framework for biclustering of gene expression data , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[59]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[60]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[61]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[62]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[63]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[64]  Jie Chen,et al.  Identifying the Patterns of Hematopoietic Stem Cells Gene Expressions Using Clustering Methods: Comparison and Summary , 2004, Journal of Data Science.