Efficient interpretable variants of online SOM for large dissimilarity data

Self-organizing maps (SOM) are a useful tool for exploring data. In its original version, the SOM algorithm was designed for numerical vectors. Since then, several extensions have been proposed to handle complex datasets described by (dis)similarities. Most of these extensions represent prototypes by a list of (dis)similarities with the entire dataset and suffer from several drawbacks: their complexity is increased - it becomes quadratic instead of linear -, the stability is reduced and the interpretability of the prototypes is lost.In the present article, we propose and compare two extensions of the stochastic SOM for (dis)similarity data: the first one takes advantage of the online setting in order to maintain a sparse representation of the prototypes at each step of the algorithm, while the second one uses a dimension reduction in a feature space defined by the (dis)similarity. Our contributions to the analysis of (dis)similarity data with topographic maps are thus twofolds: first, we present a new version of the SOM algorithm which ensures a sparse representation of the prototypes through online updates. Second, this approach is compared on several benchmarks to a standard dimension reduction technique (K-PCA), which is itself adapted to large datasets with the Nystrom approximation.Results demonstrate that both approaches lead to reduce the prototypes dimensionality while providing accurate results in a reasonable computational time. Selecting one of these two strategies depends on the dataset size, the need to easily interpret the results and the computational facilities available. The conclusion tries to provide some recommendations to help the user making this choice.

[1]  Marie Cottrell,et al.  How to use the Kohonen algorithm to simultaneously analyze individuals and modalities in a survey , 2005, Neurocomputing.

[2]  Georg Pölzlbauer Survey and Comparison of Quality Measures for Self-Organizing Maps , 2004 .

[3]  Frank-Michael Schleif,et al.  Learning interpretable kernelized prototype-based models , 2014, Neurocomputing.

[4]  A. Abbott,et al.  Optimal Matching Methods for Historical Sequences , 1986 .

[5]  Peter Sarlin,et al.  Cluster Coloring of the Self-Organizing Map: An Information Visualization Perspective , 2013, 2013 17th International Conference on Information Visualisation.

[6]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[7]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[8]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[9]  Fabrice Rossi How Many Dissimilarity/Kernel Self Organizing Map Variants Do We Need? , 2014, WSOM.

[10]  Andreas Rauber,et al.  Advanced visualization of Self-Organizing Maps with vector fields , 2006, Neural Networks.

[11]  Fabrice Rossi,et al.  Accelerating Relational Clustering Algorithms With Sparse Prototype Representation , 2007 .

[12]  Lada A. Adamic,et al.  The political blogosphere and the 2004 U.S. election: divided they blog , 2005, LinkKDD '05.

[13]  Cees H. Elzinga,et al.  Sequence Similarity , 2003 .

[14]  Barbara Hammer,et al.  Parametric nonlinear dimensionality reduction using kernel t-SNE , 2015, Neurocomputing.

[15]  Antonio Neme,et al.  Stylistics analysis and authorship attribution algorithms based on self-organizing maps , 2015, Neurocomputing.

[16]  Marie Cottrell,et al.  Analysis of professional trajectories using disconnected self-organizing maps , 2015, Neurocomputing.

[17]  Madalina Olteanu,et al.  On-line relational and multiple relational SOM , 2015, Neurocomputing.

[18]  Hareton K. N. Leung,et al.  Incremental Semi-Supervised Clustering Ensemble for High Dimensional Data Clustering , 2016, IEEE Transactions on Knowledge and Data Engineering.

[19]  Lev Goldfarb,et al.  A unified approach to pattern recognition , 1984, Pattern Recognit..

[20]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[21]  Fabrice Rossi,et al.  Fast Algorithm and Implementation of Dissimilarity Self-Organizing Maps , 2006, Neural Networks.

[22]  Jane You,et al.  Representative Distance: A New Similarity Measure for Class Discovery From Gene Expression Data , 2012, IEEE Transactions on NanoBioscience.

[23]  Madalina Olteanu,et al.  Sparse Online Self-Organizing Maps for Large Relational Data , 2016, WSOM.

[24]  D. Janzen,et al.  Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Horst Bischof,et al.  On-line Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[26]  Barbara Hammer,et al.  The Nystrom approximation for relational generative topographic mappings , 2010, NIPS 2010.

[27]  Barbara Hammer,et al.  Topographic Mapping of Large Dissimilarity Data Sets , 2010, Neural Computation.

[28]  Fabrice Rossi,et al.  Batch kernel SOM and related Laplacian methods for social network analysis , 2008, Neurocomputing.

[29]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[30]  Madalina Olteanu,et al.  Bagged Kernel SOM , 2014, WSOM.

[31]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[32]  Barbara Hammer,et al.  Efficient approximations of robust soft learning vector quantization for non-vectorial data , 2015, Neurocomputing.

[33]  Colin Fyfe,et al.  The kernel self-organising map , 2000, KES'2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516).

[34]  R. Knight,et al.  Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities , 2007, Applied and Environmental Microbiology.

[35]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[36]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[37]  Ameet Talwalkar,et al.  Sampling Techniques for the Nystrom Method , 2009, AISTATS.

[38]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[39]  Jane You,et al.  Visual query processing for efficient image retrieval using a SOM-based filter-refinement scheme , 2012, Inf. Sci..

[40]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[41]  Nathalie Villa-Vialaneix,et al.  Aggregating Self-Organizing Maps with Topology Preservation , 2016, WSOM.

[42]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[43]  Frank-Michael Schleif,et al.  Approximation techniques for clustering dissimilarity data , 2012, Neurocomputing.

[44]  C. Meyer,et al.  DNA Barcoding: Error Rates Based on Comprehensive Sampling , 2005, PLoS biology.

[45]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[46]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[47]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[48]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[49]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[50]  Madalina Olteanu,et al.  SOMbrero: An R Package for Numeric and Non-numeric Self-Organizing Maps , 2014, WSOM.

[51]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[52]  Panu Somervuo,et al.  Self-organizing maps of symbol strings , 1998, Neurocomputing.

[53]  Maya R. Gupta,et al.  Similarity-based Classification: Concepts and Algorithms , 2009, J. Mach. Learn. Res..

[54]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[55]  Brian S. Penn,et al.  Using self-organizing maps to visualize high-dimensional data , 2005, Comput. Geosci..

[56]  Misha Denil,et al.  Consistency of Online Random Forests , 2013, ICML.

[57]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[58]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[59]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[60]  Xiangrui Meng,et al.  Scalable Simple Random Sampling and Stratified Sampling , 2013, ICML.

[61]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .