Clustering short time series gene expression data

MOTIVATION Time series expression experiments are used to study a wide range of biological systems. More than 80% of all time series expression datasets are short (8 time points or fewer). These datasets present unique challenges. On account of the large number of genes profiled (often tens of thousands) and the small number of time points many patterns are expected to arise at random. Most clustering algorithms are unable to distinguish between real and random patterns. RESULTS We present an algorithm specifically designed for clustering short time series expression data. Our algorithm works by assigning genes to a predefined set of model profiles that capture the potential distinct patterns that can be expected from the experiment. We discuss how to obtain such a set of profiles and how to determine the significance of each of these profiles. Significant profiles are retained for further analysis and can be combined to form clusters. We tested our method on both simulated and real biological data. Using immune response data we show that our algorithm can correctly detect the temporal profile of relevant functional categories. Using Gene Ontology analysis we show that our algorithm outperforms both general clustering algorithms and algorithms designed specifically for clustering time series gene expression data. AVAILABILITY Information on obtaining a Java implementation with a graphical user interface (GUI) is available from http://www.cs.cmu.edu/~jernst/st/ SUPPLEMENTARY INFORMATION Available at http://www.cs.cmu.edu/~jernst/st/

[1]  Paola Sebastiani,et al.  Cluster analysis of gene expression dynamics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[2]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[3]  Shyamal D. Peddada,et al.  Gene Selection and Clustering for Time-course and Dose-response Microarray Experiments Using Order-restricted Inference , 2003, Bioinform..

[4]  Kai-Florian Storch,et al.  Extensive and divergent circadian gene expression in liver and heart , 2002, Nature.

[5]  L. P. Zhao,et al.  Statistical modeling of large microarray data sets to identify stimulus-response profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Nicola J. Rinaldi,et al.  Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle , 2001, Cell.

[7]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Zhaohui S. Qin,et al.  Statistical resynchronization and Bayesian detection of periodically expressed genes. , 2004, Nucleic acids research.

[9]  Joshua M. Korn,et al.  The plasticity of dendritic cell responses to pathogens and their components. , 2001, Science.

[10]  B. S. Baker,et al.  Gene Expression During the Life Cycle of Drosophila melanogaster , 2002, Science.

[11]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[12]  David Botstein,et al.  The Stanford Microarray Database: data access and quality assessment tools , 2003, Nucleic Acids Res..

[13]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[14]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[15]  Tommi S. Jaakkola,et al.  Continuous Representations of Time-Series Gene Expression Data , 2003, J. Comput. Biol..

[16]  Chen-An Tsai,et al.  Testing for differentially expressed genes with microarray data. , 2003, Nucleic acids research.

[17]  Giovanna Lucchini,et al.  The Plasticity of Dendritic Cell Responses to Pathogens and Their Components , 2001 .

[18]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Alexander Schliep,et al.  Using hidden Markov models to analyze gene expression time course data , 2003, ISMB.

[20]  P. Brown,et al.  A specific gene expression program triggered by Gram-positive bacteria in the cytosol. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Kwang-Hyun Cho,et al.  Microarray data clustering based on temporal variation: FCV with TSD preclustering. , 2003, Applied bioinformatics.

[22]  E. Salmon Gene Expression During the Life Cycle of Drosophila melanogaster , 2002 .

[23]  S. Falkow,et al.  Cag pathogenicity island-specific responses of gastric epithelial cells to Helicobacter pylori infection , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[25]  Satoru Miyano,et al.  Statistical analysis of a small set of time-ordered gene expression data using linear splines , 2002, Bioinform..