A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning

Cluster analysis of genome-wide expression data from DNA microarray hybridization studies is a useful tool for identifying biologically relevant gene groupings (DeRisi et al. 1997; Weiler et al. 1997). It is hence important to apply a rigorous yet intuitive clustering algorithm to uncover these genomic relationships. In this study, we describe a novel clustering algorithm framework based on a variant of the Generalized Benders Decomposition, denoted as the Global Optimum Search (Floudas et al. 1989; Floudas 1995), which includes a procedure to determine the optimal number of clusters to be used. The approach involves a pre-clustering of data points to define an initial number of clusters and the iterative solution of a Linear Programming problem (the primal problem) and a Mixed-Integer Linear Programming problem (the master problem), that are derived from a Mixed Integer Nonlinear Programming problem formulation. Badly placed data points are removed to form new clusters, thus ensuring tight groupings amongst the data points and incrementing the number of clusters until the optimum number is reached. We apply the proposed clustering algorithm to experimental DNA microarray data centered on the Ras signaling pathway in the yeast Saccharomyces cerevisiae and compare the results to that obtained with some commonly used clustering algorithms. Our algorithm compares favorably against these algorithms in the aspects of intra-cluster similarity and inter-cluster dissimilarity, often considered two key tenets of clustering. Furthermore, our algorithm can predict the optimal number of clusters, and the biological coherence of the predicted clusters is analyzed through gene ontology.

[1]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[2]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[3]  J. Hoheisel,et al.  Hybridisation based DNA screening on peptide nucleic acid (PNA) oligomer arrays. , 1997, Nucleic acids research.

[4]  Umeshwar Dayal,et al.  K-Harmonic Means - A Data Clustering Algorithm , 1999 .

[5]  C. Floudas Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications , 1995 .

[6]  W. Bialek,et al.  Information-based clustering. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  B. S. Duran,et al.  Cluster Analysis: A Survey , 1974 .

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[10]  C. Floudas,et al.  Synthesis of general distillation sequences : nonsharp separations , 1990 .

[11]  Robert E. Johnson The Role of Cluster Analysis in Assessing Comparability under the U.S. Transfer Pricing Regulations , 2001 .

[12]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[13]  Pierre Hansen,et al.  Cluster analysis and mathematical programming , 1997, Math. Program..

[14]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[15]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[16]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[17]  Christodoulos A. Floudas,et al.  APROS: Algorithmic Development Methodology for Discrete-Continuous Optimization Problems , 1989, Oper. Res..

[18]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[20]  Inderjit S. Dhillon,et al.  Information theoretic clustering of sparse cooccurrence data , 2003, Third IEEE International Conference on Data Mining.

[21]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[24]  Hanif D. Sherali,et al.  A Global Optimization RLT-based Approach for Solving the Fuzzy Clustering Problem , 2005, J. Glob. Optim..

[25]  Ying Wang,et al.  Theoretical and computational studies of the glucose signaling pathways in yeast using global gene expression data , 2003, Biotechnology and bioengineering.

[26]  Christodoulos A. Floudas,et al.  Deterministic Global Optimization: Theory, Methods and (NONCONVEX OPTIMIZATION AND ITS APPLICATIONS Volume 37) (Nonconvex Optimization and Its Applications) , 2005 .

[27]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[28]  Saeed Tavazoie,et al.  Ras and Gpa2 Mediate One Branch of a Redundant Glucose Signaling Pathway in Yeast , 2004, PLoS biology.

[29]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[30]  Bin Zhang Generalized K-Harmonic Means -- Boosting in Unsupervised Learning , 2000 .

[31]  David Kendrick,et al.  GAMS, a user's guide , 1988, SGNM.

[32]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[33]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Bin Zhang,et al.  Genera lized K- Harmonic Means - - Boosting in Unsupervised Learnin g , 2000 .

[35]  HalkidiMaria,et al.  Cluster validity methods , 2002 .

[36]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[37]  Christodoulos A. Floudas,et al.  Synthesis of flexible heat exchanger networks with uncertain flowrates and temperatures , 1987 .

[38]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[39]  Friedrich Leisch,et al.  Competitive Learning for Binary Valued Data , 1998 .

[40]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[41]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[42]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[43]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[44]  Lei Guo,et al.  Predicting Gene Expression from Sequence: A Reexamination , 2007, PLoS Comput. Biol..

[45]  C. Floudas,et al.  Global optimum search for nonconvex NLP and MINLP problems , 1989 .

[46]  Ding-Zhu Du,et al.  A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering , 2003, J. Glob. Optim..

[47]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[48]  Rainer Fuchs,et al.  Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters , 2001, Bioinform..

[49]  A. M. Geoffrion Generalized Benders decomposition , 1972 .

[50]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[51]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[52]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Eric J. Pauwels,et al.  Finding Salient Regions in Images: Nonparametric Clustering for Image Segmentation and Grouping , 1999, Comput. Vis. Image Underst..

[54]  Hanif D. Sherali,et al.  Linearization Strategies for a Class of Zero-One Mixed Integer Programming Problems , 1990, Oper. Res..

[55]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[56]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[57]  Hanif D. Sherali,et al.  A Global Optimization RLT-based Approach for Solving the Hard Clustering Problem , 2005, J. Glob. Optim..

[58]  Christodoulos A. Floudas,et al.  Optimization of complex reactor networks—I. Isothermal operation , 1990 .

[59]  J. Claverie Computational methods for the identification of differential and coordinated gene expression. , 1999, Human molecular genetics.

[60]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[61]  Christodoulos A. Floudas,et al.  A retrofit approach for heat exchanger networks , 1989 .

[62]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[63]  Christodoulos A. Floudas,et al.  Global optimization in the 21st century: Advances and challenges , 2005, Comput. Chem. Eng..

[64]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[65]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[66]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[67]  Stephen Grossberg,et al.  ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures , 1990, Neural Networks.

[68]  Christodoulos A. Floudas,et al.  Synthesis of distillation sequences with several multicomponent feed and product streams , 1988 .

[69]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[70]  Lisa Schneper,et al.  Sense and sensibility: nutritional response and signal integration in yeast. , 2004, Current opinion in microbiology.