iVisClustering: An Interactive Visual Document Clustering via Topic Modeling

Clustering plays an important role in many large‐scale data analyses providing users with an overall understanding of their data. Nonetheless, clustering is not an easy task due to noisy features and outliers existing in the data, and thus the clustering results obtained from automatic algorithms often do not make clear sense. To remedy this problem, automatic clustering should be complemented with interactive visualization strategies. This paper proposes an interactive visual analytics system for document clustering, called iVisClustering, based on a widely‐used topic modeling method, latent Dirichlet allocation (LDA). iVisClustering provides a summary of each cluster in terms of its most representative keywords and visualizes soft clustering results in parallel coordinates. The main view of the system provides a 2D plot that visualizes cluster similarities and the relation among data items with a graph‐based representation. iVisClustering provides several other views, which contain useful interaction methods. With help of these visualization modules, we can interactively refine the clustering results in various ways. Keywords can be adjusted so that they characterize each cluster better. In addition, our system can filter out noisy data and re‐cluster the data accordingly. Cluster hierarchy can be constructed using a tree structure and for this purpose, the system supports cluster‐level interactions such as sub‐clustering, removing unimportant clusters, merging the clusters that have similar meanings, and moving certain clusters to any other node in the tree structure. Furthermore, the system provides document‐level interactions such as moving mis‐clustered documents to another cluster and removing useless documents. Finally, we present how interactive clustering is performed via iVisClustering by using real‐world document data sets.

[1]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[4]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[5]  John Scott What is social network analysis , 2010 .

[6]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  Eser Kandogan Star Coordinates: A Multi-dimensional Visualization Technique with Uniform Treatment of Dimensions , 2000 .

[9]  Peter Eades,et al.  Journal of Graph Algorithms and Applications Navigating Clustered Graphs Using Force-directed Methods , 2022 .

[10]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[11]  John Riedl,et al.  Recommender Systems for Large-scale E-Commerce : Scalable Neighborhood Formation Using Clustering , 2002 .

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[14]  Kresimir Simunic,et al.  Interactive Clustering for Exploring Large Document Pools , 2004 .

[15]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[16]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data , 2005, Inf. Vis..

[17]  Jeffrey Heer,et al.  prefuse: a toolkit for interactive information visualization , 2005, CHI.

[18]  Keke Chen,et al.  iVIBRATE: Interactive visualization-based framework for clustering large datasets , 2006, ACM Trans. Inf. Syst..

[19]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[20]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Daoqiang Zhang,et al.  Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation , 2007, Pattern Recognit..

[23]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[24]  James Allan,et al.  Interactive Clustering of Text Collections According to a User-Specified Criterion , 2007, IJCAI.

[25]  Marie desJardins,et al.  Interactive visual clustering , 2007, IUI '07.

[26]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[27]  Lei Shi,et al.  Understanding text corpora with multiple facets , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[28]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[29]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[30]  Tom A. B. Snijders,et al.  Social Network Analysis , 2011, International Encyclopedia of Statistical Science.

[31]  Xin Tong,et al.  TextFlow: Towards Better Understanding of Evolving Topics in Text , 2011, IEEE Transactions on Visualization and Computer Graphics.

[32]  William Ribarsky,et al.  ParallelTopics: A probabilistic approach to exploring document collections , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[33]  Haesun Park,et al.  Fast Nonnegative Matrix Factorization: An Active-Set-Like Method and Comparisons , 2011, SIAM J. Sci. Comput..

[34]  Steven M. Drucker,et al.  Helping Users Sort Faster with Adaptive Machine Learning Recommendations , 2011, INTERACT.