Human-Machine Collaboration for Democratizing Data Science

Everybody wants to analyse their data, but only few posses the data science expertise to to this. Motivated by this observation we introduce a novel framework and system \textsc{VisualSynth} for human-machine collaboration in data science. It wants to democratize data science by allowing users to interact with standard spreadsheet software in order to perform and automate various data analysis tasks ranging from data wrangling, data selection, clustering, constraint learning, predictive modeling and auto-completion. \textsc{VisualSynth} relies on the user providing colored sketches, i.e., coloring parts of the spreadsheet, to partially specify data science tasks, which are then determined and executed using artificial intelligence techniques.

[1]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[2]  Enrico Bertini,et al.  Using Visual Analytics to Interpret Predictive Machine Learning Models , 2016, ArXiv.

[3]  Alan F. Blackwell,et al.  Interactive visual machine learning in spreadsheets , 2015, 2015 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[4]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[5]  Silvia Miksch,et al.  Characterizing Guidance in Visual Analytics , 2017, IEEE Transactions on Visualization and Computer Graphics.

[6]  Angela Bonifati,et al.  Learning Join Queries from User Examples , 2016, ACM Trans. Database Syst..

[7]  Luc De Raedt,et al.  A perspective on inductive databases , 2002, SKDD.

[8]  Stef van Buuren,et al.  Multiple imputation of discrete and continuous data by fully conditional specification , 2007 .

[9]  Sumit Gulwani,et al.  Spreadsheet data manipulation using examples , 2012, CACM.

[10]  Per Ola Kristensson,et al.  A Review of User Interface Design for Interactive Machine Learning , 2018, ACM Trans. Interact. Intell. Syst..

[11]  Luc De Raedt,et al.  Learning SMT(LRA) Constraints using SMT Solvers , 2018, IJCAI.

[12]  Stephen Muggleton,et al.  Efficient Induction of Logic Programs , 1990, ALT.

[13]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[14]  Alan F. Blackwell,et al.  Teach and try: A simple interaction technique for exploratory data modelling by end users , 2014, 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[15]  Luc De Raedt,et al.  Predictive spreadsheet autocompletion with constraints , 2019, Machine Learning.

[16]  Luc De Raedt,et al.  Logical and relational learning , 2008, Cognitive Technologies.

[17]  Minsuk Kahng,et al.  Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers , 2018, IEEE Transactions on Visualization and Computer Graphics.

[18]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.

[19]  Felienne Hermans Improving spreadsheet test practices , 2013, CASCON.

[20]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[21]  Gilles Louppe,et al.  Independent consultant , 2013 .

[22]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[23]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[24]  Pushmeet Kohli,et al.  RobustFill: Neural Program Learning under Noisy I/O , 2017, ICML.

[25]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[26]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[27]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[28]  Maya Cakmak,et al.  Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[29]  Luc De Raedt,et al.  Learning Constraints From Examples , 2018, AAAI.

[30]  Fritz Scheuren,et al.  Multiple Imputation , 2005 .

[31]  Hendrik Blockeel,et al.  COBRA: A Fast and Simple Method for Active Clustering with Pairwise Constraints , 2018, IJCAI.

[32]  Sumit Gulwani,et al.  Spreadsheet table transformations from examples , 2011, PLDI '11.

[33]  Sumit Gulwani,et al.  Inductive programming meets the real world , 2015, Commun. ACM.

[34]  Kanit Wongsuphasawat,et al.  Voyager 2: Augmenting Visual Analysis with Partial View Specifications , 2017, CHI.

[35]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[36]  Luc De Raedt,et al.  Automatically Wrangling Spreadsheets into Machine Learning Data Formats , 2018, IDA.

[37]  Luc De Raedt,et al.  Learning constraints in spreadsheets and tabular data , 2017, Machine Learning.

[38]  Kristin A. Cook,et al.  Illuminating the Path: The Research and Development Agenda for Visual Analytics , 2005 .

[39]  S. van Buuren Multiple imputation of discrete and continuous data by fully conditional specification , 2007, Statistical methods in medical research.

[40]  Jerry Alan Fails,et al.  Interactive machine learning , 2003, IUI '03.

[41]  Christopher Scaffidi,et al.  Struggling to Excel: A Field Study of Challenges Faced by Spreadsheet Users , 2010, 2010 IEEE Symposium on Visual Languages and Human-Centric Computing.

[42]  Wei Chen,et al.  A Survey of Visual Analytic Pipelines , 2016, Journal of Computer Science and Technology.

[43]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[44]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[45]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[46]  Gordon Plotkin,et al.  A Note on Inductive Generalization , 2008 .

[47]  J. Ross Quinlan,et al.  Learning logical definitions from relations , 1990, Machine Learning.

[48]  Helwig Hauser,et al.  Visualization and Visual Analysis of Multifaceted Scientific Data: A Survey , 2013, IEEE Transactions on Visualization and Computer Graphics.

[49]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[50]  Carla E. Brodley,et al.  Deploying an interactive machine learning system in an evidence-based practice center: abstrackr , 2012, IHI '12.

[51]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[52]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[53]  Helmut Simonis,et al.  A Model Seeker: Extracting Global Constraint Models from Positive Examples , 2012, CP.

[54]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[55]  Hendrik Blockeel,et al.  COBRAS: Interactive Clustering with Pairwise Queries , 2018, IDA.

[56]  M. Pontil Leave-one-out error and stability of learning algorithms with applications , 2002 .

[57]  Kristin Branson,et al.  JAABA: interactive machine learning for automatic annotation of animal behavior , 2013, Nature Methods.