Finding large average submatrices in high dimensional data

The search for sample-variable associations is an important problem in the exploratory analysis of high dimensional data. Biclustering methods search for sample-variable associations in the form of distinguished submatrices of the data matrix. (The rows and columns of a submatrix need not be contiguous.) In this paper we propose and evaluate a statistically motivated biclustering procedure (LAS) that finds large average submatrices within a given real-valued data matrix. The procedure operates in an iterative-residual fashion, and is driven by a Bonferroni-based significance score that effectively trades off between submatrix size and average value. We examine the performance and potential utility of LAS, and compare it with a number of existing methods, through an extensive three-part validation study using two gene expression datasets. The validation study examines quantitative properties of biclusters, biological and clinical assessments using auxiliary information, and classification of disease subtypes using bicluster membership. In addition, we carry out a simulation study to assess the effectiveness and noise sensitivity of the LAS search procedure. These results suggest that LAS is an effective exploratory tool for the discovery of biologically relevant structures in high dimensional data. Software is available at this https URL

[1]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[2]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[3]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[6]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[9]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[10]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[12]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[13]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[16]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[17]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[18]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[19]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[20]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[22]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[23]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Daphne Koller,et al.  Decomposing Gene Expression into Cellular Processes , 2002, Pacific Symposium on Biocomputing.

[25]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[26]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[27]  Jinze Liu,et al.  Biclustering in gene expression data by tendency , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[28]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[29]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[31]  J. Ball,et al.  Statistics review 12: Survival analysis , 2004, Critical care.

[32]  Sun-Yuan Kung,et al.  Multi-class biclustering and classification based on modeling of gene regulatory networks , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[33]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[34]  Wojtek J. Krzanowski,et al.  Biclustering models for structured microarray data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Gregory A. Grothaus Biologically-Interpretable Disease Classication Based on Gene Expression Data , 2005 .

[36]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[37]  J. Rissanen An Introduction to the MDL Principle , 2005 .

[38]  C. Perou,et al.  Molecular portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer. , 2005, Cancer research.

[39]  Stefano Monti,et al.  Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[40]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[41]  A. Nobel,et al.  The molecular portraits of breast tumors are conserved across microarray platforms , 2006, BMC Genomics.

[42]  A. Nobel,et al.  Concordance among Gene-Expression – Based Predictors for Breast Cancer , 2011 .

[43]  Russell Greiner,et al.  Using Rank-One Biclusters to Classify Microarray Data , 2007 .

[44]  Jun S Liu,et al.  Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[45]  S. Kaski,et al.  Bayesian biclustering with the plaid model , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.