Self-organizing subspace clustering for high-dimensional and multi-view data

A surge in the availability of data from multiple sources and modalities is correlated with advances in how to obtain, compress, store, transfer, and process large amounts of complex high-dimensional data. The clustering challenge increases with the growth of data dimensionality which decreases the discriminate power of the distance metrics. Subspace clustering aims to group data drawn from a union of subspaces. In such a way, there is a large number of state-of-the-art approaches and we divide them into families regarding the method used in the clustering. We introduce a soft subspace clustering algorithm, a Self-organizing Map (SOM) with a time-varying structure, to cluster data without any prior knowledge of the number of categories or of the neural network topology, both determined during the training process. The model also assigns proper relevancies (weights) to different dimensions, capturing from the learning process the influence of each dimension on uncovering clusters. We employ a number of real-world datasets to validate the model. This algorithm presents a competitive performance in a diverse range of contexts among them data mining, gene expression, multi-view, computer vision and text clustering problems which include high-dimensional data. Extensive experiments suggest that our method very often outperforms the state-of-the-art approaches in all types of problems considered.

[1]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  S. F. Rodd,et al.  Subspace Clustering—A Survey , 2018, Data Management, Analytics and Innovation.

[3]  Hansenclever F. Bassani,et al.  A Neural Network Architecture for Learning Word-Referent Associations in Multiple Contexts , 2019, Neural Networks.

[4]  Valerio Pascucci,et al.  Visualizing High-Dimensional Data: Advances in the Past Decade , 2017, IEEE Transactions on Visualization and Computer Graphics.

[5]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[7]  Hal Daumé,et al.  Co-regularized Multi-view Spectral Clustering , 2011, NIPS.

[8]  S. Shankar Sastry,et al.  Generalized principal component analysis (GPCA) , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Zenglin Xu,et al.  Semi-supervised deep embedded clustering , 2019, Neurocomputing.

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Zhaohong Deng,et al.  A survey on soft subspace clustering , 2014, Inf. Sci..

[12]  Chunyan Miao,et al.  Salience-aware adaptive resonance theory for large-scale sparse data clustering , 2019, Neural Networks.

[13]  Wei-Yun Yau,et al.  Structured AutoEncoders for Subspace Clustering , 2018, IEEE Transactions on Image Processing.

[14]  John Wright,et al.  Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Marc Pollefeys,et al.  A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate , 2006, ECCV.

[16]  Yi Zhang,et al.  Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. , 2006, The Journal of molecular diagnostics : JMD.

[17]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[18]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[19]  Xuelong Li,et al.  Robust Subspace Clustering by Cauchy Loss Function , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Hongbin Zha,et al.  Essential Tensor Learning for Multi-View Spectral Clustering , 2018, IEEE Transactions on Image Processing.

[21]  Rui Xu,et al.  BARTMAP: A viable structure for biclustering , 2011, Neural Networks.

[22]  René Vidal,et al.  Sparse Subspace Clustering: Algorithm, Theory, and Applications , 2012, IEEE transactions on pattern analysis and machine intelligence.

[23]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[24]  René Vidal,et al.  Subspace Clustering , 2011, IEEE Signal Processing Magazine.

[25]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[26]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[27]  Guangliang Chen,et al.  Spectral Curvature Clustering (SCC) , 2009, International Journal of Computer Vision.

[28]  Qinghua Hu,et al.  Generalized Latent Multi-View Subspace Clustering , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  René Vidal,et al.  Low rank subspace clustering (LRSC) , 2014, Pattern Recognit. Lett..

[30]  Hansenclever F. Bassani,et al.  Dynamic topology and relevance learning SOM-based algorithm for image clustering tasks , 2019, Comput. Vis. Image Underst..

[31]  Krzysztof Simiński,et al.  Clustering in fuzzy subspaces , 2012 .

[32]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[33]  Jiawei Han,et al.  Multi-View Clustering via Joint Nonnegative Matrix Factorization , 2013, SDM.

[34]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[35]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[36]  Aluizio F. R. Araújo,et al.  Dimension Selective Self-Organizing Maps With Time-Varying Structure for Subspace and Projected Clustering , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Bo Li,et al.  Information Theoretic Subspace Clustering , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[38]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[39]  Zhaohong Deng,et al.  Distance metric learning for soft subspace clustering in composite kernel space , 2016, Pattern Recognit..

[40]  Angshul Majumdar,et al.  Graph structured autoencoder , 2018, Neural Networks.

[41]  S. Chatterjee Sensitivity analysis in linear regression , 1988 .

[42]  Zhang Yi,et al.  A Unified Framework for Representation-Based Subspace Clustering of Out-of-Sample and Large-Scale Data , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[44]  Jianhong Wu,et al.  A convergence theorem for the fuzzy subspace clustering (FSC) algorithm , 2008, Pattern Recognit..

[45]  Michael K. Ng,et al.  Subspace clustering with automatic feature grouping , 2015, Pattern Recognit..

[46]  Moumita Saha,et al.  A Graph Based Approach to Multiview Clustering , 2013, PReMI.

[47]  Tal Hassner,et al.  Similarity Scores Based on Background Samples , 2009, ACCV.

[48]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[49]  Liang Wang,et al.  Multi-view clustering via pairwise sparse subspace representation , 2015, Neurocomputing.

[50]  Thomas Villmann,et al.  Generalized relevance learning vector quantization , 2002, Neural Networks.

[51]  Yuan Xie,et al.  On Unifying Multi-view Self-Representations for Clustering by Tensor Multi-rank Minimization , 2016, International Journal of Computer Vision.

[52]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[53]  Yunming Ye,et al.  A feature group weighting method for subspace clustering of high-dimensional data , 2012, Pattern Recognit..

[54]  Junping Du,et al.  Low Rank Subspace Clustering via Discrete Constraint and Hypergraph Regularization for Tumor Molecular Pattern Discovery , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Jun Wang,et al.  Feature Concatenation Multi-view Subspace Clustering , 2019, Neurocomputing.

[56]  Aluizio F. R. Araújo,et al.  Self-organizing maps with a time-varying structure , 2013, CSUR.

[57]  Bianca Zadrozny,et al.  Categorizing feature selection methods for multi-label classification , 2016, Artificial Intelligence Review.

[58]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.