Clustering using PK-D: A connectivity and density dissimilarity

New dissimilarity joining connectivity and density information.Clustering using low vector space representation based on the new dissimilarity.Interesting clustering application using gene expression and image data.Improved clustering quality of simple algorithms like k-means. We present a new dissimilarity, which combines connectivity and density information. Usually, connectivity and density are conceived as mutually exclusive concepts; however, we discuss a novel procedure to merge both information sources. Once we have calculated the new dissimilarity, we apply MDS in order to find a low dimensional vector space representation. The new data representation can be used for clustering and data visualization, which is not pursued in this paper. Instead we use clustering to estimate the gain from our approach consisting of dissimilarity + MDS. Hence, we analyze the partitions' quality obtained by clustering high dimensional data with various well known clustering algorithms based on density, connectivity and message passing, as well as simple algorithms like k-means and Hierarchical Clustering (HC). The quality gap between the partitions found by k-means and HC alone compared to k-means and HC using our new low dimensional vector space representation is remarkable. Moreover, our tests using high dimensional gene expression and image data confirm these results and show a steady performance, which surpasses spectral clustering and other algorithms relevant to our work.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Edwin R. Hancock,et al.  Ricci flow embedding for rectifying non-Euclidean dissimilarity data , 2014, Pattern Recognit..

[3]  Pablo M. Granitto,et al.  Automatic classification of legumes using leaf vein image features , 2014, Pattern Recognit..

[4]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[5]  Zhiyong Lu,et al.  Automatic Extraction of Clusters from Hierarchical Clustering Representations , 2003, PAKDD.

[6]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[7]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[8]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[9]  Mykola Pechenizkiy,et al.  A comparative study of dimensionality reduction techniques to enhance trace clustering performances , 2013, Expert Syst. Appl..

[10]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[11]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[12]  Pablo M. Granitto,et al.  Clustering gene expression data with a penalized graph-based metric , 2011, BMC Bioinformatics.

[13]  Robert P. W. Duin,et al.  An Empirical Comparison of Kernel-Based and Dissimilarity-Based Feature Spaces , 2010, SSPR/SPR.

[14]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[15]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  Pablo M. Granitto,et al.  How Many Clusters: A Validation Index for Arbitrary-Shaped Clusters , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[19]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[20]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[21]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Tülin Inkaya,et al.  A density and connectivity based decision rule for pattern classification , 2015, Expert Syst. Appl..

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[25]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[26]  Mahesh Motwani,et al.  Survey of clustering algorithms for MANET , 2009, ArXiv.

[27]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[29]  L. Hubert,et al.  Comparing partitions , 1985 .

[30]  Robert P. W. Duin,et al.  Beyond Traditional Kernels: Classification in Two Dissimilarity-Based Representation Spaces , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  Joseph T. Chang,et al.  Spectral biclustering of microarray cancer data : co-clustering genes and conditions , 2003 .

[32]  Robert P. W. Duin,et al.  A Generalized Kernel Approach to Dissimilarity-based Classification , 2002, J. Mach. Learn. Res..

[33]  Tülin Inkaya,et al.  A parameter-free similarity graph for spectral clustering , 2015, Expert Syst. Appl..

[34]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[35]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[36]  Meirav Galun,et al.  Fundamental Limitations of Spectral Clustering , 2006, NIPS.

[37]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[38]  Xinquan Chen,et al.  A new clustering algorithm based on near neighbor influence , 2014, Expert Syst. Appl..

[39]  Christopher K. I. Williams On a Connection between Kernel PCA and Metric Multidimensional Scaling , 2004, Machine Learning.