Finding Time-Lagged 3D Clusters

Existing 3D clustering algorithms on gene × sample × time expression data do not consider the time lags between correlated gene expression patterns. Besides, they either ignore the correlation on time subseries, or disregard the continuity of the time series, or only validate pure shifting or pure scaling coherent patterns instead of the general shifting and-scaling patterns. In this paper, we propose a novel 3D cluster model, S2D3 Cluster, to address these problems, where S2 reflects the shifting-and-scaling correlation and D3 the 3-Dimensional gene × sample × time data. Within the S2D3 Cluster model, expression levels of genes are shifting-and-scaling coherent in both sample subspace and time subseries with arbitrary time lags. We develop a 3D clustering algorithm, LagMiner, for identifying interesting S2D3 Clusters that satisfy the constraints of regulation (γ), coherence (γ), minimum gene number (MinG), minimum sample subspace size (MinS) and minimum time periods length (MinT). Experimental results on both synthetic and real-life datasets show that LagMiner is effective, scalable and parameter-robust. While we use gene expression data in this paper, our model and algorithm can be applied on any other data where both spatial and temporal coherence are pursued.

[1]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[4]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[5]  Alan M. Frieze,et al.  On the power of universal bases in sequencing by hybridization , 1999, RECOMB.

[6]  Alan M. Frieze,et al.  Optimal Reconstruction of a Sequence from its Probes , 1999, J. Comput. Biol..

[7]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[8]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[10]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[11]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[12]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[13]  Steven Skiena,et al.  Analysis techniques for microarray time-series data , 2001, RECOMB.

[14]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[15]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[16]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[18]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[19]  Paola Sebastiani,et al.  Cluster analysis of gene expression dynamics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[21]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[22]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[23]  Anthony K. H. Tung,et al.  COBBLER: combining column and row enumeration for closed pattern discovery , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[24]  Gregory Stephanopoulos,et al.  Elucidation of gene interaction networks through time-lagged correlation analysis of transcriptional data. , 2004, Genome research.

[25]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[26]  Jian Pei,et al.  Mining coherent gene clusters from gene-sample-time microarray data , 2004, KDD.

[27]  Weiqi Wang,et al.  Gene ontology friendly biclustering of expression profiles , 2004 .

[28]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[29]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[30]  Mohammed J. Zaki,et al.  TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data , 2005, SIGMOD '05.

[31]  Anthony K. H. Tung,et al.  Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[32]  Anthony K. H. Tung,et al.  CSV: visualizing and mining cohesive subgraphs , 2008, SIGMOD Conference.