iVIBRATE: Interactive visualization-based framework for clustering large datasets

With continued advances in communication network technology and sensing technology, there is astounding growth in the amount of data produced and made available through cyberspace. Efficient and high-quality clustering of large datasets continues to be one of the most important problems in large-scale data analysis. A commonly used methodology for cluster analysis on large datasets is the three-phase framework of sampling/summarization, iterative cluster analysis, and disk-labeling. There are three known problems with this framework which demand effective solutions. The first problem is how to effectively define and validate irregularly shaped clusters, especially in large datasets. Automated algorithms and statistical methods are typically not effective in handling these particular clusters. The second problem is how to effectively label the entire data on disk (disk-labeling) without introducing additional errors, including the solutions for dealing with outliers, irregular clusters, and cluster boundary extension. The third obstacle is the lack of research about issues related to effectively integrating the three phases. In this article, we describe iVIBRATE---an interactive visualization-based three-phase framework for clustering large datasets. The two main components of iVIBRATE are its VISTA visual cluster-rendering subsystem which invites human interplay into the large-scale iterative clustering process through interactive visualization, and its adaptive ClusterMap labeling subsystem which offers visualization-guided disk-labeling solutions that are effective in dealing with outliers, irregular clusters, and cluster boundary extension. Another important contribution of iVIBRATE development is the identification of the special issues presented in integrating the two components and the sampling approach into a coherent framework, as well as the solutions for improving the reliability of the framework and for minimizing the amount of errors generated within the cluster analysis process. We study the effectiveness of the iVIBRATE framework through a walkthrough example dataset of a million records and we experimentally evaluate the iVIBRATE approach using both real-life and synthetic datasets. Our results show that iVIBRATE can efficiently involve the user in the clustering process and generate high-quality clustering results for large datasets.

[1]  Sheldon M. Ross,et al.  Introduction to Probability Models, Eighth Edition , 1972 .

[2]  Sheldon M. Ross,et al.  Introduction to probability models , 1975 .

[3]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[4]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[5]  Daniel Asimov,et al.  The grand tour: a tool for viewing multidimensional data , 1985 .

[6]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[7]  Herbert A. Simon,et al.  Why a Diagram is (Sometimes) Worth Ten Thousand Words , 1987, Cogn. Sci..

[8]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Matthew O. Ward,et al.  Exploring N-dimensional databases , 1990, Proceedings of the First IEEE Conference on Visualization: Visualization `90.

[11]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[12]  Matthew O. Ward,et al.  XmdvTool: integrating multiple methods for visualizing multivariate data , 1994, Proceedings Visualization '94.

[13]  Andreas Buja,et al.  Grand tour and projection pursuit , 1995 .

[14]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[15]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[16]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[19]  Peter L. Brooks,et al.  Visualizing data , 1997 .

[20]  Alfred Inselberg,et al.  Multidimensional detective , 1997, Proceedings of VIZ '97: Visualization Conference, Information Visualization Symposium and Parallel Rendering Symposium.

[21]  Georges G. Grinstein,et al.  DNA visual and analytic data mining , 1997, Proceedings. Visualization '97 (Cat. No. 97CB36155).

[22]  Jan O. Pedersen,et al.  Almost-constant-time clustering of arbitrary corpus subsets4 , 1997, SIGIR '97.

[23]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[25]  Georges G. Grinstein,et al.  DNA visual and analytic data mining , 1997 .

[26]  Inderjit S. Dhillon,et al.  Visualizing Class Structure of Multidimensional Data , 1998 .

[27]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[28]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[29]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[30]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[31]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[32]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection , 1998 .

[33]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[34]  Jim Gray,et al.  What Next? A Few Remaining Problems in Information Technology , 1998, ACM SIGMOD Conference.

[35]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[36]  Marshall Ramsey,et al.  Interactive Internet search through automatic clustering (poster abstract): an empirical study , 1999, SIGIR '99.

[37]  Daniel A. Keim,et al.  HD-Eye: Visual Mining of High-Dimensional Data , 1999, IEEE Computer Graphics and Applications.

[38]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[39]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[40]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[41]  Daniel A. Keim,et al.  Visual mining of high-dimensional data , 1999 .

[42]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[43]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[44]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[45]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[46]  Jean Gallier,et al.  Geometric Methods and Applications: For Computer Science and Engineering , 2000 .

[47]  Jim Gray What Next? A Few Remaining Problems in Information Technlogy, SIGMOD Conference 1999, ACM Turing Award Lecture, Video , 2000, ACM SIGMOD Digit. Symp. Collect..

[48]  Li Yang,et al.  Interactive exploration of very large relational datasets through 3D dynamic projections , 2000, KDD '00.

[49]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD 2000.

[50]  Daniel A. Keim,et al.  Visual exploration of large data sets , 2001, Commun. ACM.

[51]  Eser Kandogan,et al.  Visualizing multi-dimensional clusters, trends, and outliers using star coordinates , 2001, KDD '01.

[52]  Ben Shneiderman,et al.  Inventing Discovery Tools: Combining Information Visualization with Data Mining1 , 2001, Inf. Vis..

[53]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[54]  Ben Shneiderman,et al.  Interactively Exploring Hierarchical Clustering Results , 2002, Computer.

[55]  Bo Thiesson,et al.  The Learning-Curve Sampling Method Applied to Model-Based Clustering , 2002, J. Mach. Learn. Res..

[56]  Illinois Wesleyan Magazine,et al.  The Grand Tour , 2002 .

[57]  Edward Y. Chang,et al.  Clustering for Approximate Similarity Search in High-Dimensional Spaces , 2002, IEEE Trans. Knowl. Data Eng..

[58]  Matthew O. Ward,et al.  Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate data sets , 2003, Comput. Graph..

[59]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[60]  Keke Chen,et al.  VISTA: Validating and Refining Clusters Via Visualization , 2004, Inf. Vis..

[61]  Keke Chen,et al.  ClusterMap: labeling clusters in large datasets via visualization , 2004, CIKM '04.

[62]  Keke Chen,et al.  The "Best K" for Entropy-based Categorical Data Clustering , 2005, SSDBM.

[63]  Lakshmish Ramaswamy,et al.  A distributed approach to node clustering in decentralized peer-to-peer networks , 2005, IEEE Transactions on Parallel and Distributed Systems.