PRESISTANT: Learning based assistant for data pre-processing

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks.

[1]  Harald Steck,et al.  Evaluation of recommendations: rating-prediction and ranking , 2013, RecSys.

[2]  Hilan Bensusan,et al.  Meta-Learning by Landmarking Various Learning Algorithms , 2000, ICML.

[3]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[4]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[5]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Alexandre Quemy,et al.  Data Pipeline Selection and Optimization , 2019, DOLAP.

[8]  Alberto Abelló,et al.  Towards Intelligent Data Analysis: The Metadata Challenge , 2016, IoTBD.

[9]  Christophe G. Giraud-Carrier,et al.  The data mining advisor: meta-learning at the service of practitioners , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[10]  Bernd Neumayr,et al.  The VADA Architecture for Cost-Effective Data Wrangling , 2017, SIGMOD Conference.

[11]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[12]  Paolo Papotti,et al.  The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[13]  Derek H. Sleeman,et al.  Consultant-2: pre- and post-processing of Machine Learning applications , 1995, Int. J. Hum. Comput. Stud..

[14]  Andreas Dengel,et al.  Automatic classifier selection for non-experts , 2012, Pattern Analysis and Applications.

[15]  Alberto Abelló,et al.  On the predictive power of meta-features in OpenML , 2017, Int. J. Appl. Math. Comput. Sci..

[16]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[17]  Ricardo Vilalta,et al.  Metalearning - Applications to Data Mining , 2008, Cognitive Technologies.

[18]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[19]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[20]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[21]  Melanie Hilario,et al.  Using Meta-mining to Support Data Mining Workflow Planning and Optimization , 2014, J. Artif. Intell. Res..

[22]  Peter A. Flach,et al.  Improved Dataset Characterisation for Meta-learning , 2002, Discovery Science.

[23]  Jan Raes,et al.  Inside two commercially available statistical expert systems , 1992 .

[24]  Alberto Abelló,et al.  Intelligent assistance for data pre-processing , 2018, Comput. Stand. Interfaces.

[25]  H. V. Jagadish,et al.  Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs , 2017, SIGMOD Conference.

[26]  Melanie Hilario,et al.  Model selection via meta-learning: a comparative study , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[27]  Abraham Bernstein,et al.  A survey of intelligent assistants for data analysis , 2013, CSUR.

[28]  Paolo Papotti,et al.  BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[29]  Ahmed Eldawy,et al.  NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[30]  Sanjay Krishnan,et al.  ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning , 2016, SIGMOD Conference.

[31]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[32]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[33]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[34]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[35]  Christophe G. Giraud-Carrier,et al.  On the dangers of default implementations: The case of radial basis function networks , 2014, Intell. Data Anal..

[36]  Mohammad Ghavamzadeh,et al.  Automated Data Cleansing through Meta-Learning , 2017, AAAI.

[37]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[38]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[39]  Abraham Bernstein,et al.  "Semantics Inside!" But Let's Not Tell the Data Miners: Intelligent Support for Data Mining , 2014, ESWC.

[40]  Claudia Diamantini,et al.  Ontology-Driven KDD Process Composition , 2009, IDA.

[41]  Rudi Studer,et al.  AST: Support for Algorithm Selection with a CBR Approach , 1999, PKDD.

[42]  Katharina Morik,et al.  The MiningMart Approach , 2002, GI Jahrestagung.

[43]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[44]  Frank Hutter,et al.  Initializing Bayesian Hyperparameter Optimization via Meta-Learning , 2015, AAAI.

[45]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[46]  Michael Stonebraker,et al.  DataXFormer: An Interactive Data Transformation Tool , 2015, SIGMOD Conference.

[47]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[48]  Alberto Abelló,et al.  Automated Data Pre-processing via Meta-learning , 2016, MEDI.

[49]  Sumit Gulwani,et al.  Learning Semantic String Transformations from Examples , 2012, Proc. VLDB Endow..

[50]  Jeffrey C. Carver,et al.  Using Empirical Studies during Software Courses , 2003, ESERNET.

[51]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[52]  Tim Furche,et al.  Data Wrangling for Big Data: Challenges and Opportunities , 2016, EDBT.

[53]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[54]  Laure Berti-Équille,et al.  Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation , 2019, WWW.

[55]  Alexandros Kalousis,et al.  Algorithm selection via meta-learning , 2002 .

[56]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[57]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[58]  Renée J. Miller,et al.  Continuous data cleaning , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[59]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[60]  M. Arthur Munson,et al.  A study on the importance of and time spent on different modeling steps , 2012, SKDD.

[61]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[62]  Alberto Abelló,et al.  PRESISTANT: Data Pre-processing Assistant , 2018, CAiSE Forum.

[63]  Michael Stonebraker,et al.  The Data Civilizer System , 2017, CIDR.