Nuclear Norm Clustering: a promising alternative method for clustering tasks

Clustering techniques are widely used in many applications. The goal of clustering is to identify patterns or groups of similar objects within a dataset of interest. However, many cluster methods are neither robust nor sensitive to noises and outliers in real data. In this paper, we present Nuclear Norm Clustering (NNC, available at https://sourceforge.net/projects/nnc/), an algorithm that can be used in various fields as a promising alternative to the k-means clustering method. The NNC algorithm requires users to provide a data matrix M and a desired number of cluster K. We employed simulated annealing techniques to choose an optimal label vector that minimizes nuclear norm of the pooled within cluster residual matrix. To evaluate the performance of the NNC algorithm, we compared the performance of both 15 public datasets and 2 genome-wide association studies (GWAS) on psoriasis, comparing our method with other classic methods. The results indicate that NNC method has a competitive performance in terms of F-score on 15 benchmarked public datasets and 2 psoriasis GWAS datasets. So NNC is a promising alternative method for clustering tasks.

[1]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[2]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[3]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Kevin Baker,et al.  Classification of radar returns from the ionosphere using neural networks , 1989 .

[5]  S. Liberty,et al.  Linear Systems , 2010, Scientific Parallel Computing.

[6]  J. Sengupta The Nonparametric Approach , 1989 .

[7]  Royster C. Hedgepeth An Exploratory Analysis , 2016 .

[8]  Jan Baumbach,et al.  Comparing the performance of biomedical clustering methods , 2015, Nature Methods.

[9]  Yi Li,et al.  Random bits regression: a strong general predictor for big data , 2015, ArXiv.

[10]  A. Kassambara,et al.  Extract and Visualize the Results of Multivariate Data Analyses [R package factoextra version 1.0.7] , 2020 .

[11]  Sanjoy Dasgupta,et al.  Random projection trees for vector quantization , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[12]  Max A. Little,et al.  Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection , 2007, Biomedical engineering online.

[13]  W. N. Street,et al.  Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates. , 1994, Cancer letters.

[14]  Jon M. Kleinberg,et al.  A Microeconomic View of Data Mining , 1998, Data Mining and Knowledge Discovery.

[15]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[16]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17]  Michael Randolph Garey,et al.  The complexity of the generalized Lloyd - Max problem , 1982, IEEE Trans. Inf. Theory.

[18]  Momiao Xiong,et al.  Psoriasis prediction from genome-wide SNP profiles , 2011, BMC dermatology.

[19]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  I-Cheng Yeh,et al.  Knowledge discovery on RFM model using Bernoulli sequence , 2009, Expert Syst. Appl..

[22]  Mia Hubert,et al.  Clustering in an object-oriented environment , 1997 .

[23]  Stefan Jenisch,et al.  Sequence and haplotype analysis supports HLA-C as the psoriasis susceptibility 1 gene. , 2006, American journal of human genetics.

[24]  Dit-Yan Yeung,et al.  Robust path-based spectral clustering , 2008, Pattern Recognit..

[25]  N. B. Venkateswarlu,et al.  A Critical Comparative Study of Liver Patients from USA and INDIA: An Exploratory Analysis , 2012 .

[26]  Roberto Todeschini,et al.  Quantitative Structure − Activity Relationship Models for Ready Biodegradability of Chemicals , 2013 .

[27]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[28]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[29]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[30]  Charu C. Aggarwal An Introduction to Cluster Analysis , 2013, Data Clustering: Algorithms and Applications.

[31]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[32]  TWO-WEEK Loan COpy,et al.  University of California , 1886, The American journal of dental science.

[33]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[34]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[35]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[36]  Ewa Piętka,et al.  Information Technologies in Biomedicine , 2008, Lecture Notes in Computer Science.

[37]  M. Elter,et al.  The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. , 2007, Medical physics.

[38]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[39]  Anil K. Jain Data Clustering: User's Dilemma , 2007, MLDM.

[40]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[41]  Mia Hubert,et al.  Integrating robust clustering techniques in S-PLUS , 1997 .

[42]  Yi Li,et al.  Random Bits Forest: a Strong Classifier/Regressor for Big Data , 2016, Scientific Reports.