Comparison of Clustering Methods for Time Course Genomic Data: Applications to Aging Effects

Author(s): Zhang, Yafeng; Horvath, Steve; Ophoff, Roel; Telesca, Donatello | Abstract: Time course microarray data provide insight about dynamic biological processes. While several clustering methods have been proposed for the analysis of these data structures, comparison and selection of appropriate clustering methods are seldom discussed. We compared $3$ probabilistic based clustering methods and $3$ distance based clustering methods for time course microarray data. Among probabilistic methods, we considered: smoothing spline clustering also known as model based functional data analysis (MFDA), functional clustering models for sparsely sampled data (FCM) and model-based clustering (MCLUST). Among distance based methods, we considered: weighted gene co-expression network analysis (WGCNA), clustering with dynamic time warping distance (DTW) and clustering with autocorrelation based distance (ACF). We studied these algorithms in both simulated settings and case study data. Our investigations showed that FCM performed very well when gene curves were short and sparse. DTW and WGCNA performed well when gene curves were medium or long ($g=10$ observations). SSC performed very well when there were clusters of gene curves similar to one another. Overall, ACF performed poorly in these applications. In terms of computation time, FCM, SSC and DTW were considerably slower than MCLUST and WGCNA. WGCNA outperformed MCLUST by generating more accurate and biological meaningful clustering results. WGCNA and MCLUST are the best methods among the 6 methods compared, when performance and computation time are both taken into account. WGCNA outperforms MCLUST, but MCLUST provides model based inference and uncertainty measure of clustering results.

[1]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[2]  Wenxuan Zhong,et al.  A data-driven clustering method for time course gene expression data , 2006, Nucleic acids research.

[3]  Pierpaolo D’Urso,et al.  Autocorrelation-based fuzzy clustering of time series , 2009, Fuzzy Sets Syst..

[4]  Hongzhe Li,et al.  Clustering of time-course gene expression data using a mixed-effects model with B-splines , 2003, Bioinform..

[5]  Wolfgang Wagner,et al.  Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. , 2010, Genome research.

[6]  Peter Langfelder,et al.  Genetic analysis of DNA methylation and gene expression levels in whole blood of healthy human subjects , 2012, BMC Genomics.

[7]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[8]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[9]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[10]  Catherine A. Sugar,et al.  Clustering for Sparsely Sampled Functional Data , 2003 .

[11]  S. Horvath,et al.  A General Framework for Weighted Gene Co-Expression Network Analysis , 2005, Statistical applications in genetics and molecular biology.

[12]  Markus Perola,et al.  An Immune Response Network Associated with Blood Lipid Levels , 2010, PLoS genetics.

[13]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[14]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[15]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[16]  Wenxuan Zhong,et al.  Penalized Clustering of Large-Scale Functional Data With Multiple Covariates , 2008, 0801.2555.

[17]  Padhraic Smyth,et al.  Curve Clustering with Random Effects Regression Mixtures , 2003, AISTATS.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Lurdes Y T Inoue,et al.  Cluster-based network model for time-course gene expression data. , 2007, Biostatistics.

[20]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Clustering, Density Estimation and Discriminant Analysis , 2002 .

[21]  S. Horvath,et al.  Aging effects on DNA methylation modules in human brain and blood tissue , 2012, Genome Biology.

[22]  Rui Luo,et al.  Is My Network Module Preserved and Reproducible? , 2011, PLoS Comput. Biol..

[23]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[24]  Jon Wakefield,et al.  A Bayesian Mixture Model for Partitioning Gene Expression Data , 2006, Biometrics.

[25]  L. Almasy,et al.  Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes , 2007, Nature Genetics.

[26]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Donatello Telesca,et al.  Differential Expression and Network Inferences through Functional Data Modeling , 2009, Biometrics.

[28]  L. Hubert,et al.  Comparing partitions , 1985 .

[30]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .