Clustrophile 2: Guided Visual Clustering Analysis

Data clustering is a common unsupervised learning method frequently used in exploratory data analysis. However, identifying relevant structures in unlabeled, high-dimensional data is nontrivial, requiring iterative experimentation with clustering parameters as well as data features and instances. The number of possible clusterings for a typical dataset is vast, and navigating in this vast space is also challenging. The absence of ground-truth labels makes it impossible to define an optimal solution, thus requiring user judgment to establish what can be considered a satisfiable clustering result. Data scientists need adequate interactive tools to effectively explore and navigate the large clustering space so as to improve the effectiveness of exploratory clustering analysis. We introduce Clustrophile 2, a new interactive tool for guided clustering analysis. Clustrophile 2 guides users in clustering-based exploratory analysis, adapts user feedback to improve user guidance, facilitates the interpretation of clusters, and helps quickly reason about differences between clusterings. To this end, Clustrophile 2 contributes a novel feature, the Clustering Tour, to help users choose clustering parameters and assess the quality of different clustering results in relation to current analysis goals and user expectations. We evaluate Clustrophile 2 through a user study with 12 data scientists, who used our tool to explore and interpret sub-cohorts in a dataset of Parkinson's disease patients. Results suggest that Clustrophile 2 improves the speed and effectiveness of exploratory clustering analysis for both experts and non-experts.

[1]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[2]  Ben Shneiderman,et al.  Interactively Exploring Hierarchical Clustering Results , 2003 .

[3]  Dieter Schmalstieg,et al.  Comparative Analysis of Multidimensional, Quantitative Data , 2010, IEEE Transactions on Visualization and Computer Graphics.

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  Aditya G. Parameswaran,et al.  SeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics , 2015, Proc. VLDB Endow..

[6]  Britne A. Shabbott,et al.  Motor control abnormalities in Parkinson's disease. , 2012, Cold Spring Harbor perspectives in medicine.

[7]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[8]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[9]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[10]  John W. Tukey,et al.  PRIM-9: An Interactive Multi-dimensional Data Display and Analysis System , 1975, ACM Pacific.

[11]  Klaus Mueller,et al.  ClusterSculptor: A Visual Analytics Tool for High-Dimensional Data , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[12]  Y. Takane,et al.  Multidimensional Scaling I , 2015 .

[13]  Georges G. Grinstein,et al.  Visually comparing multiple partitions of data with applications to clustering , 2009, Electronic Imaging.

[14]  Leland Wilkinson,et al.  AutoVis: Automatic Visualization , 2010, Inf. Vis..

[15]  Tobias Schreck,et al.  Visual Cluster Analysis of Trajectory Data with Interactive Kohonen Maps , 2008, 2008 IEEE Symposium on Visual Analytics Science and Technology.

[16]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Unsupervised Multidimensional Data Exploration Using Low Dimensional Projections , 2004, IEEE Symposium on Information Visualization.

[17]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Peter J. Haas,et al.  Foresight: Recommending Visual Insights , 2017, Proc. VLDB Endow..

[20]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[21]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .

[22]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[23]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[24]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[25]  J. Jankovic,et al.  Movement Disorder Society‐sponsored revision of the Unified Parkinson's Disease Rating Scale (MDS‐UPDRS): Scale presentation and clinimetric testing results , 2008, Movement disorders : official journal of the Movement Disorder Society.

[26]  P. Bruneau,et al.  Cluster Sculptor, an interactive visual clustering system , 2015, Neurocomputing.

[27]  Jimeng Sun,et al.  DICON: Interactive Visual Analysis of Multidimensional Clusters , 2011, IEEE Transactions on Visualization and Computer Graphics.

[28]  Marc Streit,et al.  Furby: fuzzy force-directed bicluster visualization , 2014, BMC Bioinformatics.

[29]  Çagatay Demiralp,et al.  Clustrophile: A Tool for Visual Clustering Analysis , 2017, ArXiv.

[30]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[31]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[32]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[33]  Kenney Ng,et al.  Clustervision: Visual Supervision of Unsupervised Clustering , 2018, IEEE Transactions on Visualization and Computer Graphics.

[34]  Jinwook Seo,et al.  XCluSim: a visual analytics tool for interactively comparing multiple clustering results of bioinformatics data , 2015, BMC Bioinformatics.

[35]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Younghoon Kim,et al.  GraphScape: A Model for Automated Reasoning about Visualization Similarity and Sequencing , 2017, CHI.

[37]  Dieter Schmalstieg,et al.  StratomeX: Visual Analysis of Large‐Scale Heterogeneous Genomics Data for Cancer Subtype Characterization , 2012, Comput. Graph. Forum.

[38]  Kanit Wongsuphasawat,et al.  Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations , 2016, IEEE Transactions on Visualization and Computer Graphics.

[39]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[40]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[41]  Daniel Asimov,et al.  The grand tour: a tool for viewing multidimensional data , 1985 .

[42]  Antony Unwin,et al.  Comparing Clusterings Using Bertin's Idea , 2012, IEEE Transactions on Visualization and Computer Graphics.

[43]  Bongshin Lee,et al.  A Deeper Understanding of Sequence in Narrative Visualization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[44]  Jaak Vilo,et al.  ClustVis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap , 2015, Nucleic Acids Res..

[45]  Daniel A. Keim,et al.  SOMFlow: Guided Exploratory Cluster Analysis with Self-Organizing Maps and Analytic Provenance , 2018, IEEE Transactions on Visualization and Computer Graphics.

[46]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[47]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[48]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[49]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[50]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .