Visualizing Clustering Results

Non-hierarchical clustering has a long history in numerical taxonomy [13] and machine learning [1] with many applications in fields such as data mining [2], statistical analysis [3] and information retrieval [17]. Clustering involves finding a specific number of subgroups (k) within a set of s observations (data points/objects); each described by d attributes. A clustering algorithm generates cluster descriptions and assigns each observation to one cluster (exclusive assignment) or in part to many clusters (partial assignment). Throughout this paper, we shall refer to the output of a clustering algorithm as the clustering results, solution, or model. The information in a clustering solution is extensive, a mixture model or K-Means model produces k.s conditional probabilities or distances. Visualizing the clustering results can help to quickly assimilate this information and provide insights that support and complement textual descriptions or statistical summaries. For example, we quickly wish to know how well defined are the clusters, how different are they from each other, what is their size, and do the observations belong strongly to the cluster or only marginally? Visualizing a clustering solution has many potential uses. The analyst user during the highly iterative model building process can quickly obtain insights from the visualization that suggest the adequacy of the solution and what further experiments to conduct. Alternatively, the business user can examine and query the final clustering solution using the visualization. The interesting parts of a clustering solution will depend on the application. Database segmentation applications such as target marketing focus on the clusters and investigate which clusters are similar, which are autonomous and which have, for example, a high propensity to cross-sell. Anomaly detection applications attempt to identify those observations that do not “belong”, are interesting and require further investigation. The focus is the observations and we wish to know if they belong strongly or only marginally to their most likely cluster. Typical uses of anomaly detection are detecting money laundering, identifying network intrusion, and data cleaning [5]. In this paper, we describe a general particle framework to display the information in a clustering solution. Changes to the parameters of the framework can emphasize

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Emile H. L. Aarts,et al.  Simulated annealing and Boltzmann machines - a stochastic approach to combinatorial optimization and neural computing , 1990, Wiley-Interscience series in discrete mathematics and optimization.

[3]  J. Farris,et al.  An Introduction to Numerical Classification , 1976 .

[4]  Eser Kandogan,et al.  Visualizing multi-dimensional clusters, trends, and outliers using star coordinates , 2001, KDD '01.

[5]  Graham J. Wills,et al.  An interactive view for hierarchical clustering , 1998, Proceedings IEEE Symposium on Information Visualization (Cat. No.98TB100258).

[6]  James Allan,et al.  Lighthouse: showing the way to relevant information , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[7]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[8]  Its'hak Dinstein,et al.  On pattern classification with Sammon's nonlinear mapping an experimental study , 1998, Pattern Recognit..

[9]  Matthew O. Ward,et al.  Animating multidimensional scaling to visualize N-dimensional data sets , 1996, Proceedings IEEE Symposium on Information Visualization '96.

[10]  R. Michalski,et al.  Learning from Observation: Conceptual Clustering , 1983 .

[11]  Min Song BiblioMapper: a cluster-based information visualization technique , 1998, Proceedings IEEE Symposium on Information Visualization (Cat. No.98TB100258).

[12]  R. Cox,et al.  Journal of the Royal Statistical Society B , 1972 .

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[15]  Adrian E. Raftery,et al.  Linear flaw detection in woven textiles using model-based clustering , 1997, Pattern Recognit. Lett..

[16]  Markus H. Gross,et al.  H-BLOB: a hierarchical visual clustering method using implicit surfaces , 2000, Proceedings Visualization 2000. VIS 2000 (Cat. No.00CH37145).

[17]  Matthew O. Ward,et al.  Hierarchical parallel coordinates for exploration of large datasets , 1999, Proceedings Visualization '99 (Cat. No.99CB37067).