Fast Algorithm and Implementation of Dissimilarity Self-Organizing Maps

In many real-world applications, data cannot be accurately represented by vectors. In those situations, one possible solution is to rely on dissimilarity measures that enable a sensible comparison between observations. Kohonen's self-organizing map (SOM) has been adapted to data described only through their dissimilarity matrix. This algorithm provides both nonlinear projection and clustering of nonvector data. Unfortunately, the algorithm suffers from a high cost that makes it quite difficult to use with voluminous data sets. In this paper, we propose a new algorithm that provides an important reduction in the theoretical cost of the dissimilarity SOM without changing its outcome (the results are exactly the same as those obtained with the original algorithm). Moreover, we introduce implementation methods that result in very short running times. Improvements deduced from the theoretical cost model are validated on simulated and real-world data (a word list clustering problem). We also demonstrate that the proposed implementation methods reduce the running time of the fast algorithm by a factor up to three over a standard implementation.

[1]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[2]  Alessio Micheli,et al.  A general framework for unsupervised processing of structured data , 2004, Neurocomputing.

[3]  Y. Dodge on Statistical data analysis based on the L1-norm and related methods , 1987 .

[4]  Panu Somervuo,et al.  How to make large self-organizing maps for nonvectorial data , 2002, Neural Networks.

[5]  Klaus Obermayer,et al.  A Stochastic Self-Organizing Map for Proximity Data , 1999, Neural Computation.

[6]  Joachim M. Buhmann,et al.  A maximum entropy approach to pairwise data clustering , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[7]  Fabrice Rossi,et al.  Self-organizing maps and symbolic data , 2007, ArXiv.

[8]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Klaus Obermayer,et al.  Self-organizing maps and clustering methods for matrix data , 2004, Neural Networks.

[11]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[12]  Chih-Ping Wei,et al.  Empirical comparison of fast partitioning-based clustering algorithms for large data sets , 2003, Expert Syst. Appl..

[13]  Barbara Hammer,et al.  Neural methods for non-standard data , 2004, ESANN.

[14]  Aïcha El Golli,et al.  A Self-Organizing Map for Dissimilarity Data , 2004 .

[15]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[16]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[17]  Samuel Kaski,et al.  Self organization of a massive text document collection , 1999 .

[18]  Claus Bahlmann,et al.  The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Thomas Villmann,et al.  Classification using non-standard metrics , 2005, ESANN.

[21]  Yves Lechevallier,et al.  Usage Guided Clustering of Web Pages with the Median Self Organizing Map , 2005, ESANN.

[22]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[23]  Klaus Obermayer,et al.  Self-organizing maps: Generalizations and new optimization techniques , 1998, Neurocomputing.

[24]  Panu Somervuo,et al.  Self-organizing maps of symbol strings , 1998, Neurocomputing.