A SOM prototype-based cluster analysis methodology

An original computational approach for cluster analysis is proposed.The method consists of two phases, which are based on Self-Organizing Map.Topology-preserving and connectivity functions are used in the clustering process.The method is proved using three benchmark datasets and a real biological dataset.Automation in parameterization results in a user-friendly methodology. Data clustering is aimed at finding groups of data that share common hidden properties. These kinds of techniques are especially critical at early stages of data analysis where no information about the dataset is available. One of the mayor shortcomings of the clustering algorithms is the difficulty for non-experts users to configure them and, in some cases, interpret the results. In this work a computational approach with a two-layer structure based on Self-Organizing Map (SOM) is presented for cluster analysis. In the first level, a quantization of the data samples using topology-preserving metrics to automatically determine the number of units in the SOM is proposed. In the second level the obtained SOM prototypes are clustered by means of a connectivity analysis to explore the quality of the partitioning with different number of clusters. The most important benefit of this two-layer procedure is that computational load decreases considerably in comparison with data based clustering methods, making it possible to cluster large data sets and to consider several different clustering alternatives in a limited time. This methodology produces a two-dimensional map representation of the, usually, high dimensional input space, along with quantitative information on viable clustering alternatives, which facilitates the exploration of the possible partitions in a dataset. The efficiency and interpretation of the methodology is illustrated by its application to artificial, benchmark and real complex biological datasets. The experimental results demonstrate the ability of the method to identify possible segmentations in a dataset, compared to algorithms that only yield a single clustering solution. The proposed algorithm tackles the intrinsic limitations of SOM and the parameter settings associated with the clustering methodology, without requiring the number of clusters or the SOM architecture as a prerequisite, among others. This way, it makes possible its application even by researchers with a limited expertise in machine learning.

[1]  Ezequiel López-Rubio,et al.  Bregman Divergences for Growing Hierarchical Self-Organizing Networks , 2014, Int. J. Neural Syst..

[2]  Christian Jungreuthmayer,et al.  Elementary flux modes in a nutshell: properties, calculation and applications. , 2013, Biotechnology journal.

[3]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[4]  Yu Zong,et al.  Applied Data Mining , 2013 .

[5]  Aaron M. Newman,et al.  AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number , 2010, BMC Bioinformatics.

[6]  Bernd Fritzke,et al.  Growing cell structures--A self-organizing network for unsupervised and supervised learning , 1994, Neural Networks.

[7]  Giovanni Pezzulo,et al.  Nonparametric Problem-Space Clustering: Learning Efficient Codes for Cognitive Control Tasks , 2016, Entropy.

[8]  Erzsébet Merényi,et al.  A Validity Index for Prototype-Based Clustering of Data Sets With Complex Cluster Structures , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  G. Evanno,et al.  Detecting the number of clusters of individuals using the software structure: a simulation study , 2005, Molecular ecology.

[11]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[12]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[13]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Younès Bennani,et al.  Learning the Number of Clusters in Self Organizing Map , 2010 .

[15]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[16]  Thomas Villmann,et al.  Topology preservation in self-organizing feature maps: exact definition and measurement , 1997, IEEE Trans. Neural Networks.

[17]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[18]  Wesam M. Ashour,et al.  Efficient Data Clustering Algorithms: Improvements over Kmeans , 2013 .

[19]  Colin Fyfe,et al.  Online Clustering Algorithms , 2008, Int. J. Neural Syst..

[20]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[21]  Alfred Ultsch,et al.  The architecture of emergent self-organizing maps to reduce projection errors , 2005, ESANN.

[22]  Fernando Bação,et al.  Self-organizing Maps as Substitutes for K-Means Clustering , 2005, International Conference on Computational Science.

[23]  Erzsébet Merényi,et al.  Exploiting Data Topology in Visualization and Clustering of Self-Organizing Maps , 2009, IEEE Transactions on Neural Networks.

[24]  Siddheswar Ray,et al.  Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation , 2000 .

[25]  Melody Y. Kiang,et al.  Extending the Kohonen self-organizing map networks for clustering analysis , 2002 .

[26]  David F. Barrero,et al.  A Genetic Graph-Based Approach for Partitional Clustering , 2014, Int. J. Neural Syst..

[27]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[28]  Michel Herbin,et al.  Estimation of the number of clusters and influence zones , 2001, Pattern Recognit. Lett..

[29]  Tae-Soo Chon,et al.  Self-Organizing Maps applied to ecological sciences , 2011, Ecol. Informatics.

[30]  J. Peretó,et al.  Nature lessons: the whitefly bacterial endosymbiont is a minimal amino acid factory with unusual energetics , 2016, bioRxiv.

[31]  Erkki Oja,et al.  Engineering applications of the self-organizing map , 1996, Proc. IEEE.

[32]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[33]  Theo Geisel,et al.  A Topographic Product for the Optimization of Self-Organizing Feature Maps , 1991, NIPS.

[34]  Samuel Kaski,et al.  Comparing Self-Organizing Maps , 1996, ICANN.

[35]  Juha Vesanto,et al.  Distance Matrix Based Clustering of the Self-Organizing Map , 2002, ICANN.

[36]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  S. Kung Kernel Methods and Machine Learning , 2014 .

[38]  Dong-Jo Park,et al.  A Novel Validity Index for Determination of the Optimal Number of Clusters , 2001 .

[39]  P. Baumann Biology bacteriocyte-associated endosymbionts of plant sap-sucking insects. , 2005, Annual review of microbiology.

[40]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[41]  Tommy W. S. Chow,et al.  Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density , 2004, Pattern Recognit..

[42]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[43]  Georgia Koutrika,et al.  FlexRecs: expressing and combining flexible recommendations , 2009, SIGMOD Conference.

[44]  Peter Sarlin,et al.  Visual Predictions of Currency Crises Using Self-Organizing Maps , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[45]  Wolfgang Rosenstiel,et al.  Automatic Cluster Detection in Kohonen's SOM , 2008, IEEE Transactions on Neural Networks.

[46]  Hedieh Sajedi,et al.  A novel clustering algorithm based on data transformation approaches , 2017, Expert Syst. Appl..

[47]  E. Oja,et al.  Clustering Properties of Hierarchical Self-Organizing Maps , 1992 .

[48]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[49]  Bas Teusink,et al.  Basic concepts and principles of stoichiometric modeling of metabolic networks , 2013, Biotechnology journal.

[50]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[51]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[52]  Kimmo Kiviluoto,et al.  Topology preservation in self-organizing maps , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).