Evaluation of Normalization and Pre-Clustering Issues in a Novel Clustering Approach: Global Optimum Search with Enhanced Positioning

We study the effects on clustering quality by different normalization and pre-clustering techniques for a novel mixed-integer nonlinear optimization-based clustering algorithm, the Global Optimum Search with Enhanced Positioning (EP_GOS_Clust). These are important issues to be addressed. DNA microarray experiments are informative tools to elucidate gene regulatory networks. But in order for gene expression levels to be comparable across microarrays, normalization procedures have to be properly undertaken. The aim of pre-clustering is to use an adequate amount of discriminatory characteristics to form rough information profiles, so that data with similar features can be pre-grouped together and outliers deemed insignificant to the clustering process can be removed. Using experimental DNA microarray data from the yeast Saccharomyces Cerevisiae, we study the merits of pre-clustering genes based on distance/correlation comparisons and symbolic representations such as {+, o, -}. As a performance metric, we look at the intra- and inter-cluster error sums, two generic but intuitive measures of clustering quality. We also use publicly available Gene Ontology resources to assess the clusters' level of biological coherence. Our analysis indicates a significant effect by normalization and pre-clustering methods on the clustering results. Hence, the outcome of this study has significance in fine-tuning the EP_GOS_Clust clustering approach.

[1]  Saeed Tavazoie,et al.  Ras and Gpa2 Mediate One Branch of a Redundant Glucose Signaling Pathway in Yeast , 2004, PLoS biology.

[2]  Jens Timmer,et al.  Normalization of DNA-Microarray Data by Nonlinear Correlation Maximization , 2003, J. Comput. Biol..

[3]  Christodoulos A. Floudas,et al.  A retrofit approach for heat exchanger networks , 1989 .

[4]  Gregory C. Thornwall,et al.  The microarray explorer tool for data mining of cDNA microarrays: application for the mammary gland. , 2000, Nucleic acids research.

[5]  Horace J Spencer,et al.  Effect of Normalization on Significance Testing for Oligonucleotide Microarrays , 2004, Journal of biopharmaceutical statistics.

[6]  A. Neumaier,et al.  A global optimization method, αBB, for general twice-differentiable constrained NLPs — I. Theoretical advances , 1998 .

[7]  Christodoulos A. Floudas,et al.  Optimization of complex reactor networks—I. Isothermal operation , 1990 .

[8]  C. Floudas Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications , 1995 .

[9]  Sanjit K. Mitra,et al.  Optimized LOWESS normalization parameter selection for DNA microarray data , 2004, BMC Bioinformatics.

[10]  C. Adjiman,et al.  A global optimization method, αBB, for general twice-differentiable constrained NLPs—II. Implementation and computational results , 1998 .

[11]  C. Schlötterer,et al.  Comparison of algorithms for the analysis of Affymetrix microarray data as evaluated by co-expression of genes in known operons , 2006, Nucleic acids research.

[12]  John Quackenbush,et al.  Computational genetics: Computational analysis of microarray data , 2001, Nature Reviews Genetics.

[13]  C. Floudas,et al.  Global optimum search for nonconvex NLP and MINLP problems , 1989 .

[14]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Ding-Zhu Du,et al.  A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering , 2003, J. Glob. Optim..

[16]  Rappold,et al.  Human Molecular Genetics , 1996, Nature Medicine.

[17]  C. Floudas,et al.  Synthesis of general distillation sequences : nonsharp separations , 1990 .

[18]  J. Hoheisel,et al.  Hybridisation based DNA screening on peptide nucleic acid (PNA) oligomer arrays. , 1997, Nucleic acids research.

[19]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[20]  Kwang-Hyun Cho,et al.  Microarray data clustering based on temporal variation: FCV with TSD preclustering. , 2003, Applied bioinformatics.

[21]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  T. Speed,et al.  Statistical issues in cDNA microarray data analysis. , 2003, Methods in molecular biology.

[23]  Christodoulos A. Floudas,et al.  Synthesis of flexible heat exchanger networks with uncertain flowrates and temperatures , 1987 .

[24]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[25]  Christodoulos A. Floudas,et al.  A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning , 2007, J. Glob. Optim..

[26]  M. Rattray,et al.  A model-based analysis of microarray experimental error and normalisation. , 2003, Nucleic acids research.

[27]  K. Becker,et al.  Analysis of microarray data using Z score transformation. , 2003, The Journal of molecular diagnostics : JMD.

[28]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[29]  Lisa Schneper,et al.  Sense and sensibility: nutritional response and signal integration in yeast. , 2004, Current opinion in microbiology.

[30]  Christodoulos A. Floudas,et al.  Synthesis of distillation sequences with several multicomponent feed and product streams , 1988 .

[31]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Christodoulos A. Floudas,et al.  APROS: Algorithmic Development Methodology for Discrete-Continuous Optimization Problems , 1989, Oper. Res..

[33]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..