Spark-GHSOM: Growing Hierarchical Self-Organizing Map for large scale mixed attribute datasets

Abstract The Growing Hierarchical Self-Organizing Map (GHSOM) algorithm has shown its potential for performing several tasks such as exploratory analysis, anomaly detection and forecasting on a variety of domains including the financial and cyber-security domains. GHSOM is a dynamic variant of the SOM algorithm which generates a multi-level hierarchy of SOM maps based solely on input data. However, in order to generate this multi-level structure, GHSOM requires multiple iterations over the input dataset, thus making it intractable on large datasets. Moreover, the conventional GHSOM algorithm is designed to handle datasets with numeric attributes only. This represents an important limitation as most modern real-world datasets are characterized by mixed attributes - numerical and categorical. In this work, we propose an extension of the conventional GHSOM algorithm called Spark-GHSOM, which exploits the Spark platform to process massive datasets in a distributed manner. Moreover, we leverage a method known as the distance hierarchy approach to modify the optimization function of GHSOM so that it can (also) coherently handle mixed-attribute datasets. We test our new method with respect to accuracy, scalability and descriptive power. The results obtained using different datasets demonstrate the superior predictive and descriptive capabilities of Spark-GHSOM, as well as its applicability to large-scale datasets which could not be analyzed before.

[1]  Daniel A. Keim,et al.  SOMFlow: Guided Exploratory Cluster Analysis with Self-Organizing Maps and Analytic Provenance , 2018, IEEE Transactions on Visualization and Computer Graphics.

[2]  Reza Mikaeil,et al.  Rock Penetrability Classification Using Artificial Bee Colony (ABC) Algorithm and Self-Organizing Map , 2017, Geotechnical and Geological Engineering.

[3]  Gary D. Kader,et al.  Variability for Categorical Variables , 2007 .

[4]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[5]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[6]  Juan C. Burguillo,et al.  Time Series Prediction Using Coalitions and Self-organizing Maps , 2018 .

[7]  Antonio Martínez-Álvarez,et al.  Feature selection by multi-objective optimisation: Application to network anomaly detection by hierarchical self-organising maps , 2014, Knowl. Based Syst..

[8]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[9]  Daniel Zurita,et al.  Multimodal Forecasting Methodology Applied to Industrial Process Monitoring , 2018, IEEE Transactions on Industrial Informatics.

[10]  Mustapha Lebbah,et al.  SOM Clustering Using Spark-MapReduce , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[11]  Yu-Jung Chang,et al.  Using GHSOM to construct legal maps for Taiwan's securities and futures markets , 2008, Expert Syst. Appl..

[12]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[13]  P. Cortez,et al.  A data mining approach to predict forest fires using meteorological data , 2007 .

[14]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[15]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[16]  Chung-Chian Hsu,et al.  Generalizing self-organizing map for categorical data , 2006, IEEE Transactions on Neural Networks.

[17]  Andreas Rauber,et al.  The growing hierarchical self-organizing map , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[18]  Young-Seuk Park,et al.  Multivariate Data Analysis by Means of Self-Organizing Maps , 2018 .

[19]  Bo Li,et al.  Tumor Gene Expressive Data Classification Based on Locally Linear Representation Fisher Criterion , 2013, ICIC.

[20]  Michelangelo Ceci,et al.  Predictive Modeling of PV Energy Production: How to Set Up the Learning Task for a Better Prediction? , 2017, IEEE Transactions on Industrial Informatics.

[21]  David W. Aha,et al.  Instance‐based prediction of real‐valued attributes , 1989, Comput. Intell..

[22]  Chung-Chian Hsu,et al.  A self-organizing map for transactional data and the related categorical domain , 2012, Appl. Soft Comput..

[23]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Andrey Tovchigrechko,et al.  Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[25]  A. Chan,et al.  Growing hierarchical self organising map (GHSOM) toolbox: visualisations and enhancements , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[26]  J. A. Flanagan Unsupervised clustering of symbol strings , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[27]  Paulo Rita,et al.  Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach , 2016 .

[28]  Zengyou He,et al.  TCSOM: Clustering Transactions Using Self-Organizing Map , 2005, Neural Processing Letters.

[29]  Amod Kumar Tiwari,et al.  Approach for Information Retrieval by Using Self-Organizing Map and Crisp Set , 2018 .

[30]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[31]  R V Jensen,et al.  Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Li-Chiu Chang,et al.  Exploring the spatio-temporal interrelation between groundwater and surface water by using the self-organizing maps , 2018 .

[33]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[34]  Geoffrey A. Hollinger,et al.  Autonomous Data Collection Using a Self-Organizing Map , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[35]  A. Ultsch,et al.  Self-Organizing Neural Networks for Visualisation and Classification , 1993 .

[36]  Athanasios Tsanas,et al.  Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , 2012 .

[37]  Rob J Hyndman,et al.  Automatic Time Series Forecasting: The forecast Package for R , 2008 .

[38]  Yiu-Ming Cheung,et al.  Self-Organizing Map-Based Weight Design for Decomposition-Based Many-Objective Evolutionary Algorithm , 2018, IEEE Transactions on Evolutionary Computation.

[39]  Chi-Jie Lu,et al.  Combining independent component analysis and growing hierarchical self-organizing maps with support vector regression in product demand forecasting , 2010 .

[40]  Wei-Shen Tai,et al.  Growing Self-Organizing Map with cross insert for mixed-type data clustering , 2012, Appl. Soft Comput..

[41]  Michelangelo Ceci,et al.  ComiRNet: a web-based system for the analysis of miRNA-gene regulatory networks , 2015, BMC Bioinformatics.

[42]  Ning Chen,et al.  An Extension of Self-organizing Maps to Categorical Data , 2005, EPIA.

[43]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[44]  Andreas Rauber,et al.  The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data , 2002, IEEE Trans. Neural Networks.