Intuitive Clustering of Biological Data

K-means clustering combines a variety of striking properties because of which it is widely used in applications: training is intuitive and simple, the final classifier represents classes by geometrically meaningful prototypes, and the algorithm is quite powerful compared to more complex alternative clustering algorithms. In this contribution, we focus on extensions which incorporate additional information into the clustering algorithm to achieve a better accuracy: neighborhood cooperation from neural gas, (possibly fuzzy) label information of input data, and general problem-adapted distances instead of the standard Euclidean metric. These extensions can be formulated in a simple general framework by means of a cost function. We demonstrate the ability of these variants on several representative clustering problems from computational biology.

[1]  Alfons Juan-Císcar,et al.  On the use of normalized edit distances and an efficient k-NN search technique (k-AESA) for fast and accurate string classification , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[2]  Thomas Villmann,et al.  Supervised Neural Gas with General Similarity Measure , 2005, Neural Processing Letters.

[3]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[4]  E. Granum,et al.  Quantitative analysis of 6985 digitized trypsin G ‐banded human metaphase chromosomes , 1980, Clinical genetics.

[5]  Klaus Obermayer,et al.  A Stochastic Self-Organizing Map for Proximity Data , 1999, Neural Computation.

[6]  Ralf Zimmer,et al.  Data Processing Effects on the Interpretation of Microarray Gene Expression Experiments , 2005, German Conference on Bioinformatics.

[7]  Y-h. Taguchi,et al.  Relational patterns of gene expression via non-metric multidimensional scaling analysis , 2004, Bioinform..

[8]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.

[9]  Thomas Villmann,et al.  Generalized relevance LVQ (GRLVQ) with correlation measures for gene expression analysis , 2006, Neurocomputing.

[10]  Klaus Obermayer,et al.  Self-organizing maps and clustering methods for matrix data , 2004, Neural Networks.

[11]  Thomas Villmann,et al.  Fuzzy Labeled Self-Organizing Map with Label-Adjusted Prototypes , 2006, ANNPR.

[12]  Fabrice Rossi,et al.  A Fast Algorithm for the Self-Organizing Map on Dissimilarity Data , 2005 .

[13]  Yi Lu,et al.  Incremental genetic K-means algorithm and its application in gene expression data analysis , 2004, BMC Bioinformatics.

[14]  Jarkko Venna,et al.  Trustworthiness and metrics in visualizing similarity of gene expression , 2003, BMC Bioinformatics.

[15]  Radhakrishnan Nagarajan,et al.  Correlation Statistics for cDNA Microarray Image Analysis , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Horst Bunke,et al.  Edit distance-based kernel functions for structural pattern classification , 2006, Pattern Recognit..

[17]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[18]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[19]  Praveen Krishnamurthy,et al.  Approaches to Clustering Gene Expression Time Course Data , 2006 .

[20]  Samuel Kaski,et al.  Discriminative Clustering of Yeast Stress Response , 2005 .

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  M. Vingron,et al.  Quantifying the local reliability of a sequence alignment. , 1996, Protein engineering.

[23]  Thomas Villmann,et al.  Supervised Batch Neural Gas , 2006, ANNPR.

[24]  Thomas Villmann,et al.  Batch and median neural gas , 2006, Neural Networks.

[25]  W. N. Street,et al.  Computer-derived nuclear features distinguish malignant from benign breast cytology. , 1995, Human pathology.

[26]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[27]  Victor J. Rayward-Smith,et al.  The Use of a Supervised k-Means Algorithm on Real-Valued Data with Applications in Health , 2003, IEA/AIE.

[28]  Thomas Villmann,et al.  Fuzzy classification by fuzzy labeled neural gas , 2006, Neural Networks.

[29]  Claus Bahlmann,et al.  Learning with Distance Substitution Kernels , 2004, DAGM-Symposium.