ClusterSculptor: A Visual Analytics Tool for High-Dimensional Data

Cluster analysis (CA) is a powerful strategy for the exploration of high-dimensional data in the absence of a-priori hypotheses or data classification models, and the results of CA can then be used to form such models. But even though formal models and classification rules may not exist in these data exploration scenarios, domain scientists and experts generally have a vast amount of non-compiled knowledge and intuition that they can bring to bear in this effort. In CA, there are various popular mechanisms to generate the clusters, however, the results from their non- supervised deployment rarely fully agree with this expert knowledge and intuition. To this end, our paper describes a comprehensive and intuitive framework to aid scientists in the derivation of classification hierarchies in CA, using k-means as the overall clustering engine, but allowing them to tune its parameters interactively based on a non-distorted compact visual presentation of the inherent characteristics of the data in high- dimensional space. These include cluster geometry, composition, spatial relations to neighbors, and others. In essence, we provide all the tools necessary for a high-dimensional activity we call cluster sculpting, and the evolving hierarchy can then be viewed in a space-efficient radial dendrogram. We demonstrate our system in the context of the mining and classification of a large collection of millions of data items of aerosol mass spectra, but our framework readily applies to any high-dimensional CA scenario.

[1]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[2]  Eser Kandogan,et al.  Visualizing multi-dimensional clusters, trends, and outliers using star coordinates , 2001, KDD '01.

[3]  Haim Levkowitz,et al.  Uncovering Clusters in Crowded Parallel Coordinates Visualizations , 2004, IEEE Symposium on Information Visualization.

[4]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[5]  Georges G. Grinstein,et al.  DNA visual and analytic data mining , 1997 .

[6]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[7]  Daniel A. Keim,et al.  HD-Eye: Visual Mining of High-Dimensional Data , 1999, IEEE Computer Graphics and Applications.

[8]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[9]  Kwan-Liu Ma,et al.  StarClass: Interactive Visual Classification using Star Coordinates , 2003, SDM.

[10]  Daniel A. Keim,et al.  Designing Pixel-Oriented Visualization Techniques: Theory and Applications , 2000, IEEE Trans. Vis. Comput. Graph..

[11]  K. Mueller,et al.  SpectraMiner, an interactive data mining and visualization software for single particle mass spectroscopy: A laboratory test case , 2006 .

[12]  Haim Levkowitz,et al.  From Visual Data Exploration to Visual Data Mining: A Survey , 2003, IEEE Trans. Vis. Comput. Graph..

[13]  Matthew O. Ward,et al.  Structure-Based Brushes: A Mechanism for Navigating Hierarchically Organized Data and Information Spaces , 2000, IEEE Trans. Vis. Comput. Graph..

[14]  Heidrun Schumann,et al.  A Flexible Approach for Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[15]  R. Daniel Bergeron,et al.  Dynamic hierarchy specification and visualization , 1999, Proceedings 1999 IEEE Symposium on Information Visualization (InfoVis'99).

[16]  Alfred Inselberg,et al.  Parallel coordinates: a tool for visualizing multi-dimensional geometry , 1990, Proceedings of the First IEEE Conference on Visualization: Visualization `90.

[17]  Alla Zelenyuk,et al.  Single Particle Laser Ablation Time-of-Flight Mass Spectrometer: An Introduction to SPLAT , 2005 .

[18]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[19]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[20]  Kevin W. Boyack,et al.  Domain visualization using VxInsight® for science and technology management , 2002, J. Assoc. Inf. Sci. Technol..

[21]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  Matthew O. Ward,et al.  Clutter Reduction in Multi-Dimensional Data Visualization Using Dimension Reordering , 2004, IEEE Symposium on Information Visualization.

[24]  Matthew O. Ward,et al.  Clutter Reduction in Multi-Dimensional Data Visualization Using Dimension Reordering , 2004 .

[25]  Hans-Peter Kriegel,et al.  Towards an Effective Cooperation of the Computer and the User for Classification , 2000, KDD 2000.

[26]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[27]  Charu C. Aggarwal,et al.  A human-computer cooperative system for effective high dimensional clustering , 2001, KDD '01.

[28]  R. Tryon Cluster Analysis , 1939 .