Algorithm for Discovering Low-Variance 3-Clusters from Real-Valued Datasets

The concept of Triclusters has been investigated recently in the context of two relational datasets that share labels along one of the dimensions. By simultaneously processing two datasets to unveil triclusters, new useful knowledge and insights can be obtained. However, some recently reported methods are either closely linked to specific problems or constrain datasets to have some specific distributions. Algorithms for generating triclusters whose cell-values demonstrate simple well known statistical properties, such as upper bounds on standard deviations, are needed for many applications. In this paper we present a 3-Clustering algorithm that searches for meaningful combinations of biclusters in two related datasets. The algorithm can handle situations involving: (i) datasets in which a few data objects may be present in only one dataset and not in both datasets, (ii) the two datasets may have different numbers of objects and/or attributes, and (iii) the cell-value distributions in two datasets may be different. In our formulation the cell-values of each selected tricluster, formed by two independent biclusters, are such that the standard deviations in each bicluster obeys an upper bound and the sets of objects in the two biclusters overlap to the maximum possible extent. We present validation of our algorithm by presenting the properties of the 3-Clusters discovered from a synthetic dataset and from a real world cross-species genomic dataset. The results of our algorithm unveil interesting insights for the cross-species genomic domain.

[1]  David Tuck,et al.  An Effective Tri-Clustering Algorithm Combining Expression Data with Gene Regulation Information , 2009, Gene regulation and systems biology.

[2]  David G. Kirsch,et al.  Cross Species Genomic Analysis Identifies a Mouse Model as Undifferentiated Pleomorphic Sarcoma/Malignant Fibrous Histiocytoma , 2009, PloS one.

[3]  Jian Pei,et al.  Mining coherent gene clusters from gene-sample-time microarray data , 2004, KDD.

[4]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[5]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[6]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[7]  Zhen Hu,et al.  BMC Bioinformatics BioMed Central Methodology article CLEAN: CLustering Enrichment ANalysis , 2009 .

[8]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[9]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[11]  Mark Gerstein,et al.  Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles. , 2005, Genome research.

[12]  Olga G. Troyanskaya,et al.  Detailing regulatory networks through large scale data integration , 2009, Bioinform..

[13]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  N. Segal,et al.  Analysis of hypoxia-related gene expression in sarcomas and effect of hypoxia on RNA interference of vascular endothelial cell growth factor A. , 2005, Cancer research.

[15]  T. Graeber,et al.  Cross-species comparisons of cancer signaling , 2005, Nature Genetics.

[16]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[17]  Raj Bhatnagar,et al.  An effective algorithm for mining 3-clusters in vertically partitioned data , 2008, CIKM '08.

[18]  A. Weiner,et al.  Software L 2 L : a simple tool for discovering the hidden significance in microarray expression data , 2005 .

[19]  P. Park,et al.  Angiogenic profile of soft tissue sarcomas based on analysis of circulating factors and microarray gene expression. , 2006, The Journal of surgical research.

[20]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[21]  Xiang Zhang,et al.  Mining coherent patterns from heterogeneous microarray data , 2006, CIKM '06.