Accounting for noise when clustering biological data

Clustering is a powerful and commonly used technique that organizes and elucidates the structure of biological data. Clustering data from gene expression, metabolomics and proteomics experiments has proven to be useful at deriving a variety of insights, such as the shared regulation or function of biochemical components within networks. However, experimental measurements of biological processes are subject to substantial noise-stemming from both technical and biological variability-and most clustering algorithms are sensitive to this noise. In this article, we explore several methods of accounting for noise when analyzing biological data sets through clustering. Using a toy data set and two different case studies-gene expression and protein phosphorylation-we demonstrate the sensitivity of clustering algorithms to noise. Several methods of accounting for this noise can be used to establish when clustering results can be trusted. These methods span a range of assumptions about the statistical properties of the noise and can therefore be applied to virtually any biological data source.

[1]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[2]  Selim Mimaroglu,et al.  CLICOM: Cliques for combining multiple clusterings , 2012, Expert Syst. Appl..

[3]  Peter Sykacek,et al.  Biological assessment of robust noise models in microarray data analysis , 2011, Bioinform..

[4]  Soheil Shams,et al.  Noise Sampling Method: An ANOVA Approach Allowing Robust Selection of Differentially Regulated Genes Measured by DNA Microarrays , 2003, Bioinform..

[5]  Sampsa Hautaniemi,et al.  Effects of HER2 overexpression on cell signaling networks governing proliferation and migration , 2006, Molecular systems biology.

[6]  Ole Winther,et al.  Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm , 2006, Bioinform..

[7]  Roy E. Welsch,et al.  MCAM: Multiple Clustering Analysis Methodology for Deriving Hypotheses and Insights from High-Throughput Proteomic Datasets , 2011, PLoS Comput. Biol..

[8]  Roberto Avogadri,et al.  Fuzzy ensemble clustering based on random projections for DNA microarray data analysis , 2009, Artif. Intell. Medicine.

[9]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Paul D. W. Kirk,et al.  Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements , 2011, BMC Bioinformatics.

[11]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[13]  Alan C. Evans,et al.  Multi-level bootstrap analysis of stable clusters in resting-state fMRI , 2009, NeuroImage.

[14]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[15]  Y. Tu,et al.  Quantitative noise analysis for gene expression microarray experiments , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Kristen M. Naegle,et al.  PTMScout, a Web Resource for Analysis of High Throughput Post-translational Proteomics Studies* , 2010, Molecular & Cellular Proteomics.

[17]  David Kipling,et al.  Normality of oligonucleotide microarray data and implications for parametric statistical analyses , 2003, Bioinform..

[18]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[19]  Peng Gao,et al.  Application of fuzzy c-means clustering in data analysis of metabolomics. , 2009, Analytical chemistry.

[20]  Natthakan Iam-On,et al.  LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles , 2010 .

[21]  Johanna Hardin,et al.  A note on oligonucleotide expression values not being normally distributed. , 2009, Biostatistics.

[22]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[23]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[24]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[25]  M. Hentze,et al.  The abundance of RNPS1, a protein component of the exon junction complex, can determine the variability in efficiency of the Nonsense Mediated Decay pathway , 2007, Nucleic acids research.

[26]  William Stafford Noble,et al.  The effect of replication on gene expression microarray experiments , 2003, Bioinform..

[27]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[28]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[29]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[30]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[31]  Roberto Marcondes Cesar Junior,et al.  Inference from Clustering with Application to Gene-Expression Microarrays , 2002, J. Comput. Biol..

[32]  C. Jennison,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[33]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[34]  Eytan Domany,et al.  A module of negative feedback regulators defines growth factor signaling , 2007, Nature Genetics.

[35]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[36]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  N. Gerry,et al.  Pharmacogenomic Identification of Targets for Adjuvant Therapy with the Topoisomerase Poison Camptothecin , 2004, Cancer Research.

[38]  F. White,et al.  Temporal Dynamics of Tyrosine Phosphorylation in Insulin Signaling , 2006, Diabetes.

[39]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.