Robust Centroid-Based Clustering using Derivatives of Pearson Correlation

Modern high-throughput facilities provide the basis of -omics research by delivering extensive biomedical data sets. Mass spectra, multi-channel chromatograms, or cDNA arrays are such data sources of interest for which accurate analysis is desired. Centroid-based clustering provides helpful data abstraction by representing sets of similar data vectors by characteristic prototypes, placed in high-density regions of the data space. This way, specific modes can be detected, for example, in gene expression profiles or in lists containing protein and metabolite abundances. Despite their widespread use, k-means and self-organizing maps (SOM) often only produce suboptimum results in centroid computation: the final clusters are strongly dependent on the initialization and they do not quantize data as accurately as possible, particularly, if other than the Euclidean distance is chosen for data comparison. Neural gas (NG) is a mathematically rigorous clustering method that optimizes the centroid positions by minimizing their quantization errors. Originally formulated for Euclidean distance, in this work NG is mathematically generalized to give accurate and robust results for the Pearson correlation similarity measure. The benefits of the new NG for correlation (NG-C) are demonstrated for sets of gene expression data and mass spectra.

[1]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  T. Heskes Energy functions for self-organizing maps , 1999 .

[4]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[5]  Gilles Pagès,et al.  Two or three things that we know about the Kohonen algorithm , 1994, ESANN.

[6]  Thomas Villmann,et al.  Batch and median neural gas , 2006, Neural Networks.

[7]  Satoru Miyano,et al.  Open source clustering software , 2004 .

[8]  W. Wong,et al.  Transitive functional annotation by shortest-path analysis of gene expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Atsushi Sato,et al.  Generalized Learning Vector Quantization , 1995, NIPS.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Thomas Villmann,et al.  Magnification Control in Self-Organizing Maps and Neural Gas , 2006, Neural Computation.

[13]  S Miyano,et al.  Open source clustering software. , 2004, Bioinformatics.