Clustering by reordering of similarity and Laplacian matrices: Application to galaxy clusters

Abstract Similarity metrics, kernels and similarity-based algorithms have gained much attention due to their increasing applications in information retrieval, data mining, pattern recognition and machine learning. Similarity Graphs are often adopted as the underlying representation of similarity matrices and are at the origin of known clustering algorithms such as spectral clustering. Similarity matrices offer the advantage of working in object–object (two-dimensional) space where visualization of clusters similarities is available instead of object-features (multi-dimensional) space. In this paper, sparse ϵ -similarity graphs are constructed and decomposed into strong components using appropriate methods such as Dulmage–Mendelsohn permutation (DMperm) and/or Reverse Cuthill–McKee (RCM) algorithms. The obtained strong components correspond to groups (clusters) in the input (feature) space. Parameter ϵ i is estimated locally, at each data point i from a corresponding narrow range of the number of nearest neighbors. Although more advanced clustering techniques are available, our method has the advantages of simplicity, better complexity and direct visualization of the clusters similarities in a two-dimensional space. Also, no prior information about the number of clusters is needed. We conducted our experiments on two and three dimensional, low and high-sized synthetic datasets as well as on an astronomical real-dataset. The results are verified graphically and analyzed using gap statistics over a range of neighbors to verify the robustness of the algorithm and the stability of the results. Combining the proposed algorithm with gap statistics provides a promising tool for solving clustering problems. An astronomical application is conducted for confirming the existence of 45 galaxy clusters around the X-ray positions of galaxy clusters in the redshift range [0.1..0.8]. We re-estimate the photometric redshifts of the identified galaxy clusters and obtain acceptable values compared to published spectroscopic redshifts with a 0.029 standard deviation of their differences.

[1]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[2]  Hilo,et al.  THE ELEVENTH AND TWELFTH DATA RELEASES OF THE SLOAN DIGITAL SKY SURVEY: FINAL DATA FROM SDSS-III , 2015, 1501.00963.

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  Pasi Fränti,et al.  Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[6]  Witold Pedrycz,et al.  Springer Handbook of Computational Intelligence , 2015, Springer Handbook of Computational Intelligence.

[7]  Yongdong Zhang,et al.  On defining affinity graph for spectral clustering through ranking on manifolds , 2009, Neurocomputing.

[8]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[9]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[10]  August E. Evrard,et al.  Cosmological Parameters from Observations of Galaxy Clusters , 2011, 1103.4829.

[11]  Ariful Azad,et al.  Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting , 2017, IEEE Transactions on Parallel and Distributed Systems.

[12]  Jennifer A. Scott,et al.  Reducing the Total Bandwidth of a Sparse Unsymmetric Matrix , 2006, SIAM J. Matrix Anal. Appl..

[13]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[14]  Scott Dodelson,et al.  THE SLOAN DIGITAL SKY SURVEY CO-ADD: A GALAXY PHOTOMETRIC REDSHIFT CATALOG , 2011, 1111.6620.

[15]  F. Pedroche,et al.  On some properties of the Laplacian matrix revealed by the RCM algorithm , 2016 .

[16]  Jin-Lin Han,et al.  GALAXY CLUSTERS IDENTIFIED FROM THE SDSS DR6 AND THEIR PROPERTIES , 2009 .

[17]  Laura Maruster,et al.  From data to knowledge: a method for modeling hospital logistic processes , 2005, IEEE Transactions on Information Technology in Biomedicine.

[18]  G. Voit Tracing cosmic evolution with clusters of galaxies , 2004, astro-ph/0410173.

[19]  P. A. R. Ade,et al.  OPTICAL REDSHIFT AND RICHNESS ESTIMATES FOR GALAXY CLUSTERS SELECTED WITH THE SUNYAEV-ZEL'DOVICH EFFECT FROM 2008 SOUTH POLE TELESCOPE OBSERVATIONS , 2010, 1003.0005.

[20]  D. Gerdes,et al.  A GMBCG GALAXY CLUSTER CATALOG OF 55,424 RICH CLUSTERS FROM SDSS DR7 , 2010, 1010.5503.