Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning

Abstract Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions.

[1]  Maya Cakmak,et al.  Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[2]  Chris North,et al.  Towards a Systematic Combination of Dimension Reduction and Clustering in Visual Analytics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[3]  Daniel A. Keim,et al.  Visual analytics for concept exploration in subspaces of patient groups , 2016, Brain Informatics.

[4]  Yi Wu,et al.  Sampling Strategies for Active Learning in Personal Photo Retrieval , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[5]  Klaus Mueller,et al.  ClusterSculptor: A Visual Analytics Tool for High-Dimensional Data , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Martin Müller,et al.  Towards User‐Centered Active Learning Algorithms , 2018, Comput. Graph. Forum.

[8]  Alex Endert,et al.  The State of the Art in Integrating Machine Learning into Visual Analytics , 2017, Comput. Graph. Forum.

[9]  Alfred Inselberg,et al.  The plane with parallel coordinates , 1985, The Visual Computer.

[10]  Rosane Minghim,et al.  An Approach to Supporting Incremental Visual Data Classification , 2015, IEEE Transactions on Visualization and Computer Graphics.

[11]  Jaegul Choo,et al.  iVisClassifier: An interactive visual analytics system for classification based on supervised dimension reduction , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[12]  John T. Stasko,et al.  iVisClustering: An Interactive Visual Document Clustering via Topic Modeling , 2012, Comput. Graph. Forum.

[13]  Jonathan C. Roberts,et al.  Visual comparison for information visualization , 2011, Inf. Vis..

[14]  Foster J. Provost,et al.  Inactive learning?: difficulties employing active learning in practice , 2011, SKDD.

[15]  Daniel A. Keim,et al.  What you see is what you can change: Human-centered machine learning by interactive visualization , 2017, Neurocomputing.

[16]  George Karypis,et al.  gCLUTO – An Interactive Clustering, Visualization, and Analysis System , 2004 .

[17]  Daniel A. Keim,et al.  Visual Interaction with Dimensionality Reduction: A Structured Literature Analysis , 2017, IEEE Transactions on Visualization and Computer Graphics.

[18]  Ulrik Brandes,et al.  Quality Metrics for Information Visualization , 2018, Comput. Graph. Forum.

[19]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[20]  Cynthia A. Brewer,et al.  ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps , 2003 .

[21]  Keith Andrews,et al.  Interactive Visual Exploration of Local Patterns in Large Scatterplot Spaces , 2018, Comput. Graph. Forum.

[22]  Jingrui He,et al.  RCLens: Interactive Rare Category Exploration and Identification , 2018, IEEE Transactions on Visualization and Computer Graphics.

[23]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[24]  Arjan Kuijper,et al.  User-Based Visual-Interactive Similarity Definition for Mixed Data Objects - Concept and First Implementation , 2014 .

[25]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[26]  Marco Hutter,et al.  Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study , 2018, IEEE Transactions on Visualization and Computer Graphics.

[27]  Tobias Schreck,et al.  Interactive Regression Lens for Exploring Scatter Plots , 2017, Comput. Graph. Forum.

[28]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[29]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[30]  Arjan Kuijper,et al.  Personalized Visual-Interactive Music Classification , 2018, EuroVA@EuroVis.

[31]  Xian-Sheng Hua,et al.  Two-Dimensional Multilabel Active Learning with an Efficient Online Adaptation Model for Image Classification , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Dino Pedreschi,et al.  Interactive visual clustering of large collections of trajectories , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[33]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[34]  Gunther Heidemann,et al.  Inter-active learning of ad-hoc classifiers for video visual analytics , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[35]  Kenney Ng,et al.  Clustervision: Visual Supervision of Unsupervised Clustering , 2018, IEEE Transactions on Visualization and Computer Graphics.

[36]  Mikhail F. Kanevski,et al.  A Survey of Active Learning Algorithms for Supervised Remote Sensing Image Classification , 2011, IEEE Journal of Selected Topics in Signal Processing.

[37]  Heidrun Schumann,et al.  Integrated Visualization of Structure and Attribute Similarity of Multivariate Graphs , 2018 .

[38]  Jürgen Bernard,et al.  VIAL: a unified process for visual interactive labeling , 2018, The Visual Computer.

[39]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[40]  P. Bruneau,et al.  Cluster Sculptor, an interactive visual clustering system , 2015, Neurocomputing.

[41]  Daniel Weiskopf,et al.  Comparative eye-tracking evaluation of scatterplots and parallel coordinates , 2017, Vis. Informatics.