Finding and visualizing relevant subspaces for clustering high-dimensional astronomical data using connected morphological operators

Data sets in astronomy are growing to enormous sizes. Modern astronomical surveys provide not only image data but also catalogues of millions of objects (stars, galaxies), each object with hundreds of associated parameters. Exploration of this very high-dimensional data space poses a huge challenge. Subspace clustering is one among several approaches which have been proposed for this purpose in recent years. However, many clustering algorithms require the user to set a large number of parameters without any guidelines. Some methods also do not provide a concise summary of the datasets, or, if they do, they lack additional important information such as the number of clusters present or the significance of the clusters. In this paper, we propose a method for ranking subspaces for clustering which overcomes many of the above limitations. First we carry out a transformation from parametric space to discrete image space where the data are represented by a grid-based density field. Then we apply so-called connected morphological operators on this density field of astronomical objects that provides visual support for the analysis of the important subspaces. Clusters in subspaces correspond to high-intensity regions in the density image. The importance of a cluster is measured by a new quality criterion based on the dynamics of local maxima of the density. Connected operators are able to extract such regions with an indication of the number of clusters present. The subspaces are visualized during computation of the quality measure, so that the user can interact with the system to improve the results. In the result stage, we use three visualization toolkits linked within a graphical user interface so that the user can perform an in-depth exploration of the ranked subspaces. Evaluation based on synthetic as well as real astronomical datasets demonstrates the power of the new method. We recover various known astronomical relations directly from the data with little or no a priori assumptions. Hence, our method holds good prospects for discovering new relations as well.

[1]  Daniel Asimov,et al.  The grand tour: a tool for viewing multidimensional data , 1985 .

[2]  Elke Achtert,et al.  ELKI: A Software System for Evaluation of Subspace Clustering Algorithms , 2008, SSDBM.

[3]  Jean Serra,et al.  Image Analysis and Mathematical Morphology , 1983 .

[4]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[5]  Andreas Buja,et al.  Grand tour and projection pursuit , 1995 .

[6]  V. Petrosian,et al.  Surface brightness and evolution of galaxies , 1976 .

[7]  Philippe Salembier,et al.  Flat zones filtering, connected operators, and filters by reconstruction , 1995, IEEE Trans. Image Process..

[8]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[9]  L. Breiman,et al.  Variable Kernel Estimates of Multivariate Densities , 1977 .

[10]  Michael H. F. Wilkinson,et al.  Volumetric Attribute Filtering and Interactive Visualization Using the Max-Tree Representation , 2007, IEEE Transactions on Image Processing.

[11]  R. Nichol,et al.  Quantifying the Bimodal Color-Magnitude Distribution of Galaxies , 2003, astro-ph/0309710.

[12]  K. Abazajian,et al.  THE SEVENTH DATA RELEASE OF THE SLOAN DIGITAL SKY SURVEY , 2008, 0812.0649.

[13]  Ira Assent,et al.  VISA: visual subspace clustering analysis , 2007, SKDD.

[14]  John M. Chambers,et al.  Graphical Methods for Data Analysis , 1983 .

[15]  Hans-Peter Kriegel,et al.  Subspace selection for clustering high-dimensional data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[16]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[17]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[18]  Gutti Jogesh Babu,et al.  Statistical Challenges of Astronomy , 2003 .

[19]  M H Wilkinson,et al.  DATAPLOT: a graphical display package for bacterial morphometry and fluorimetry data. , 1995, Computer methods and programs in biomedicine.

[20]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[21]  Elske v. P. Smith,et al.  Introductory astronomy and astrophysics , 1973 .

[22]  S. Bamford,et al.  Galaxy bimodality versus stellar mass and environment , 2006, astro-ph/0607648.

[23]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[24]  Amina Helmi,et al.  Mapping the substructure in the Galactic halo with the next generation of astrometric satellites , 2000, astro-ph/0007166.

[25]  Philippe Salembier,et al.  Antiextensive connected operators for image and sequence processing , 1998, IEEE Trans. Image Process..

[26]  Gilles Bertrand,et al.  On the dynamics , 2007, Image Vis. Comput..

[27]  Alfred Inselberg,et al.  Parallel Coordinates: Visual Multidimensional Geometry and Its Applications , 2003, KDIR.

[28]  Carlos S. Frenk,et al.  Gravitational clustering from scale-free initial conditions , 1988 .

[29]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[30]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.