Partial mixture model for tight clustering of gene expression time-course

BackgroundTight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.ResultsIn this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.ConclusionFor the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the combination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion.

[1]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[2]  David W. Scott,et al.  Parametric Statistical Modeling by Minimum Integrated Square Error , 2001, Technometrics.

[3]  Fang-Xiang Wu,et al.  Dynamic Model-based Clustering for Time-course Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[4]  D. Stephens,et al.  A Quantitative Study of Gene Regulation Involved in the Immune Response of Anopheline Mosquitoes , 2006 .

[5]  Wenxuan Zhong,et al.  A data-driven clustering method for time course gene expression data , 2006, Nucleic acids research.

[6]  Paul C. Boutros,et al.  Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data , 2005, Briefings Bioinform..

[7]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[8]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[9]  Aurora García-Dorado,et al.  Comparing analysis methods for mutation-accumulation data: a simulation study. , 2003, Genetics.

[10]  Jian Pei,et al.  Mining coherent gene clusters from gene-sample-time microarray data , 2004, KDD.

[11]  James K. Lindsey,et al.  Parametric Statistical Inference , 1996 .

[12]  Chang-Tsun Li,et al.  Unsupervised Clustering of Gene Expression Time Series with Conditional Random Fields , 2007, 2007 Inaugural IEEE-IES Digital EcoSystems and Technologies Conference.

[13]  George C Tseng,et al.  Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data , 2005, Biometrics.

[14]  W. R. Schucany,et al.  Minimum Distance and Robust Estimation , 1980 .

[15]  W. Wong,et al.  Computational Biology: Toward Deciphering Gene Regulatory Information in Mammalian Genomes , 2006, Biometrics.

[16]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[17]  Bartek Wilczynski,et al.  Applying dynamic Bayesian networks to perturbed gene expression data , 2006, BMC Bioinformatics.

[18]  Brian Tjaden,et al.  Information , 2001, The Lancet.

[19]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[20]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[21]  Ivan G. Costa,et al.  Analyzing gene expression time-courses , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[23]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[24]  L. Qin,et al.  The Clustering of Regression Models Method with Applications in Gene Expression Data , 2006, Biometrics.

[25]  M. C. Jones,et al.  Robust and efficient estimation by minimising a density power divergence , 1998 .

[26]  Laura Mayoral,et al.  Minimum Distance Estimation of Stationary and Non-Stationary ARFIMA Processes , 2007 .

[27]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[28]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[29]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[30]  Adrian E. Raftery,et al.  Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST , 2003, J. Classif..

[31]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[32]  Adrian E. Raftery,et al.  MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-Based Clustering , 2006 .

[33]  Anbupalam Thalamuthu,et al.  Gene expression Evaluation and comparison of gene clustering methods in microarray analysis , 2006 .

[34]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[35]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[36]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[37]  G. Pflug Kernel Smoothing. Monographs on Statistics and Applied Probability - M. P. Wand; M. C. Jones. , 1996 .

[38]  Chuan Zhou,et al.  Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions , 2003 .

[39]  Rudolf Beran,et al.  30 Minimum distance procedures , 1984, Nonparametric Methods.

[40]  Hongzhe Li,et al.  Clustering of time-course gene expression data using a mixed-effects model with B-splines , 2003, Bioinform..

[41]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[42]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.