PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines

In recent years, a wide variety of automated machine learning (AutoML) methods have been proposed to generate end-to-end ML pipelines. While these techniques facilitate the creation of models, given their black-box nature, the complexity of the underlying algorithms, and the large number of pipelines they derive, they are difficult for developers to debug. It is also challenging for machine learning experts to select an AutoML system that is well suited for a given problem. In this paper, we present the Pipeline Profiler, an interactive visualization tool that allows the exploration and comparison of the solution space of machine learning (ML) pipelines produced by AutoML systems. PipelineProfiler is integrated with Jupyter Notebook and can be combined with common data science tools to enable a rich set of analyses of the ML pipelines, providing users a better understanding of the algorithms that generated them as well as insights into how they can be improved. We demonstrate the utility of our tool through use cases where PipelineProfiler is used to better understand and improve a real-world AutoML system. Furthermore, we validate our approach by presenting a detailed analysis of a think-aloud experiment with six data scientists who develop and evaluate AutoML tools.

[1]  Justin D. Weisz,et al.  AutoAIViz: opening the blackbox of automated artificial intelligence with conditional parallel coordinates , 2020, IUI.

[2]  Jim Tørresen,et al.  A task-and-technique centered survey on visual analytics for deep learning model engineering , 2018, Comput. Graph..

[3]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[4]  Jan N. van Rijn,et al.  Hyperparameter Importance Across Datasets , 2017, KDD.

[5]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[6]  Alexandru Telea,et al.  Combining Extended Table Lens and Treemap Techniques for Visualizing Tabular Data , 2006, EuroVis.

[7]  Alex Endert,et al.  A User‐based Visual Analytics Workflow for Exploratory Model Analysis , 2018, Comput. Graph. Forum.

[8]  Yolanda Gil,et al.  Towards human-guided machine learning , 2019, IUI.

[9]  Hendrik Strobelt,et al.  Ablate, Variate, and Contemplate: Visual Analytics for Discovering Neural Architectures , 2019, IEEE Transactions on Visualization and Computer Graphics.

[10]  MullerMichael,et al.  Human-AI Collaboration in Data Science , 2019 .

[11]  Carsten Binnig,et al.  Democratizing Data Science through Interactive Curation of ML Pipelines , 2019, SIGMOD Conference.

[12]  Juliana Freire,et al.  Visus: An Interactive System for Automatic Machine Learning Model Building and Curation , 2019, HILDA@SIGMOD.

[13]  Mitar Milutinovic On Evaluation of AutoML Systems , 2020 .

[14]  Hendrikus H. M. Korsten,et al.  RegressionExplorer: Interactive Exploration of Logistic Regression Models with Subgroup Analysis , 2019, IEEE Transactions on Visualization and Computer Graphics.

[15]  Cláudio T. Silva,et al.  Visual summaries for graph collections , 2013, 2013 IEEE Pacific Visualization Symposium (PacificVis).

[16]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[17]  L. Waller,et al.  REACT , 2020, SIGSPATIAL Special.

[18]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.

[19]  James T. Miller,et al.  An Empirical Evaluation of the System Usability Scale , 2008, Int. J. Hum. Comput. Interact..

[20]  Daniel Karl I. Weidele Conditional Parallel Coordinates , 2019, 2019 IEEE Visualization Conference (VIS).

[21]  J. B. Brooke,et al.  SUS: A 'Quick and Dirty' Usability Scale , 1996 .

[22]  Alex Endert,et al.  BEAMES: Interactive Multimodel Steering, Selection, and Inspection for Regression Tasks , 2019, IEEE Computer Graphics and Applications.

[23]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[24]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[26]  Mitar Milutinovic Towards Automatic Machine Learning Pipeline Design , 2019 .

[27]  Desney S. Tan,et al.  EnsembleMatrix: interactive visualization to support machine learning with multiple classifiers , 2009, CHI.

[28]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[29]  Samuel Madden,et al.  MacroBase: Prioritizing Attention in Fast Data , 2016, SIGMOD Conference.

[30]  Jun Yuan,et al.  Visual Genealogy of Deep Neural Networks , 2020, IEEE Transactions on Visualization and Computer Graphics.

[31]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .

[32]  Kyunghyun Cho,et al.  Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar , 2019, ArXiv.

[33]  Torsten Möller,et al.  TreePOD: Sensitivity-Aware Selection of Pareto-Optimal Decision Trees , 2018, IEEE Transactions on Visualization and Computer Graphics.

[34]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[35]  Aaron Klein,et al.  Auto-sklearn: Efficient and Robust Automated Machine Learning , 2019, Automated Machine Learning.

[36]  Frédéric Clette,et al.  International Sunspot Number , 2017 .

[37]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[38]  Jaegul Choo,et al.  VISUALHYPERTUNER: VISUAL ANALYTICS FOR USER-DRIVEN HYPERPARAMTER TUNING OF DEEP NEURAL NETWORKS , 2019 .

[39]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[40]  Arun Ross,et al.  ATM: A distributed, collaborative, scalable system for automated machine learning , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[41]  Mennatallah El-Assady,et al.  explAIner: A Visual Analytics Framework for Interactive and Explainable Machine Learning , 2019, IEEE Transactions on Visualization and Computer Graphics.

[42]  Randal S. Olson,et al.  TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning , 2016, AutoML@ICML.

[43]  Tae-Hwan Kim,et al.  The instability of the Pearson correlation coefficient in the presence of coincidental outliers , 2015 .

[44]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[45]  Gihad N. Sohsah,et al.  Classification of word levels with usage frequency, expert opinions and machine learning , 2015, Br. J. Educ. Technol..

[46]  Juliana Freire,et al.  AlphaD3M: Machine Learning Pipeline Synthesis , 2021, ArXiv.

[47]  Kevin Leyton-Brown,et al.  An Efficient Approach for Assessing Hyperparameter Importance , 2014, ICML.

[48]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[49]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[50]  Kalyan Veeramachaneni,et al.  ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning , 2019, CHI.

[51]  Michael Gleicher,et al.  Serendip: Topic model-driven visual exploration of text corpora , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[52]  Minsuk Kahng,et al.  Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers , 2018, IEEE Transactions on Visualization and Computer Graphics.

[53]  Xiting Wang,et al.  Towards better analysis of machine learning models: A visual analytics perspective , 2017, Vis. Informatics.

[54]  Kaspar Riesen,et al.  Approximate graph edit distance computation by means of bipartite graph matching , 2009, Image Vis. Comput..

[55]  Bongshin Lee,et al.  Squares: Supporting Interactive Performance Analysis for Multiclass Classifiers , 2017, IEEE Transactions on Visualization and Computer Graphics.

[56]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[57]  David Maxwell Chickering,et al.  ModelTracker: Redesigning Performance Analysis Tools for Machine Learning , 2015, CHI.