Measuring group-separability in geometrical space for evaluation of pattern recognition and embedding algorithms

Evaluating data separation in a geometrical space is fundamental for pattern recognition. A plethora of dimensionality reduction (DR) algorithms have been developed in order to reveal the emergence of geometrical patterns in a low dimensional visible representation space, in which high-dimensional samples similarities are approximated by geometrical distances. However, statistical measures to evaluate directly in the low dimensional geometrical space the sample group separability attaiend by these DR algorithms are missing. Certainly, these separability measures could be used both to compare algorithms performance and to tune algorithms parameters. Here, we propose three statistical measures (named as PSI-ROC, PSI-PR, and PSI-P) that have origin from the Projection Separability (PS) rationale introduced in this study, which is expressly designed to assess group separability of data samples in a geometrical space. Traditional cluster validity indices (CVIs) might be applied in this context but they show limitations because they are not specifically tailored for DR. Our PS measures are compared to six baseline cluster validity indices, using five non-linear datasets and six different DR algorithms. The results provide clear evidence that statistical-based measures based on PS rationale are more accurate than CVIs and can be adopted to control the tuning of parameter-dependent DR algorithms.

[1]  Susan Holmes,et al.  Ten quick tips for effective dimensionality reduction , 2019, PLoS Comput. Biol..

[2]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. I. , 1962 .

[3]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Antonio Conti,et al.  Pigment epithelium‐derived factor is differentially expressed in peripheral neuropathies , 2005, Proteomics.

[5]  Tshilidzi Marwala,et al.  A note on the separability index , 2008 .

[6]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[7]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[8]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[9]  Walter Ricciardi,et al.  Effects of Proton Pump Inhibitors on the Gastric Mucosa-Associated Microbiota in Dyspeptic Patients , 2016, Applied and Environmental Microbiology.

[10]  Slobodan Petrovic,et al.  A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters , 2006 .

[11]  Nancy Chinchor,et al.  MUC-4 evaluation metrics , 1992, MUC.

[12]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[13]  Trey Ideker,et al.  Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes , 2010, Bioinform..

[14]  Kevin Baker,et al.  Classification of radar returns from the ionosphere using neural networks , 1989 .

[15]  Frans Coenen,et al.  Best Clustering Configuration Metrics: Towards Multiagent Based Clustering , 2010, ADMA.

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[18]  J. Bezdek Cluster Validity with Fuzzy Sets , 1973 .

[19]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[20]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[21]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[22]  Shuicheng Yan,et al.  Graph embedding: a general framework for dimensionality reduction , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[24]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[25]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[26]  Daniela G. Calò,et al.  On a Transvariation Based Measure of Group Separability , 2006, J. Classif..

[27]  A. Shevchenko,et al.  Enlightening discriminative network functional modules behind Principal Component Analysis separation in differential-omic science studies , 2017, Scientific Reports.

[28]  Chris Thornton Separability is a Learner's Best Friend , 1997, NCPW.

[29]  Anders Eriksson,et al.  Highlighting nonlinear patterns in population genetics datasets , 2015, Scientific Reports.

[30]  Ian F. C. Smith,et al.  A Bounded Index for Cluster Validity , 2007, MLDM.

[31]  James M. Keller,et al.  Comparing Fuzzy, Probabilistic, and Possibilistic Partitions , 2010, IEEE Transactions on Fuzzy Systems.

[32]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[33]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. II , 1962 .

[34]  Carlo Vittorio Cannistraci,et al.  Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding , 2013, Bioinform..

[35]  E. W. Beals,et al.  Bray-curtis ordination: an effective strategy for analysis of multivariate ecological data , 1984 .

[36]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[37]  Minho Kim,et al.  New indices for cluster validity assessment , 2005, Pattern Recognit. Lett..

[38]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[39]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[40]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[41]  G. Bianconi,et al.  Machine learning meets complex networks via coalescent embedding in the hyperbolic space , 2016, Nature Communications.

[42]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[43]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[44]  James C. Bezdek,et al.  A geometric approach to cluster validity for normal mixtures , 1997, Soft Comput..

[45]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[46]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[47]  Edoardo M. Airoldi,et al.  Tree preserving embedding , 2011, Proceedings of the National Academy of Sciences.

[48]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[49]  Vladimir Makarenkov,et al.  A new fast method for inferring multiple consensus trees using k-medoids , 2018, BMC Evolutionary Biology.

[50]  Thomas Villmann,et al.  Stochastic neighbor embedding (SNE) for dimension reduction and visualization using arbitrary divergences , 2012, Neurocomputing.

[51]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[52]  Krzysztof Kryszczuk,et al.  Estimation of the Number of Clusters Using Multiple Clustering Validity Indices , 2010, MCS.

[53]  Alberto D. Pascual-Montano,et al.  A survey of dimensionality reduction techniques , 2014, ArXiv.

[54]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[55]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[56]  Stefano Cappa,et al.  Differential expression of ceruloplasmin isoforms in the cerebrospinal fluid of amyotrophic lateral sclerosis patients , 2008, Proteomics. Clinical applications.

[57]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .