论文信息 - PRESISTANT: Learning based assistant for data pre-processing

PRESISTANT: Learning based assistant for data pre-processing

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks.

[1] Harald Steck,et al. Evaluation of recommendations: rating-prediction and ranking , 2013, RecSys.

[2] Hilan Bensusan,et al. Meta-Learning by Landmarking Various Learning Algorithms , 2000, ICML.

[3] Maurizio Lenzerini,et al. Data integration: a theoretical perspective , 2002, PODS.

[4] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[5] David J. Hand,et al. Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[6] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[7] Alexandre Quemy,et al. Data Pipeline Selection and Optimization , 2019, DOLAP.

[8] Alberto Abelló,et al. Towards Intelligent Data Analysis: The Metadata Challenge , 2016, IoTBD.

[9] Christophe G. Giraud-Carrier,et al. The data mining advisor: meta-learning at the service of practitioners , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[10] Bernd Neumayr,et al. The VADA Architecture for Cost-Effective Data Wrangling , 2017, SIGMOD Conference.

[11] Guy Lapalme,et al. A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[12] Paolo Papotti,et al. The LLUNATIC Data-Cleaning Framework , 2013, Proc. VLDB Endow..

[13] Derek H. Sleeman,et al. Consultant-2: pre- and post-processing of Machine Learning applications , 1995, Int. J. Hum. Comput. Stud..

[14] Andreas Dengel,et al. Automatic classifier selection for non-experts , 2012, Pattern Analysis and Applications.

[15] Alberto Abelló,et al. On the predictive power of meta-features in OpenML , 2017, Int. J. Appl. Math. Comput. Sci..

[16] Michael Stonebraker,et al. Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[17] Ricardo Vilalta,et al. Metalearning - Applications to Data Mining , 2008, Cognitive Technologies.

[18] Dimitris Kanellopoulos,et al. Data Preprocessing for Supervised Leaning , 2007 .

[19] Jaana Kekäläinen,et al. IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[20] Kevin Leyton-Brown,et al. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[21] Melanie Hilario,et al. Using Meta-mining to Support Data Mining Workflow Planning and Optimization , 2014, J. Artif. Intell. Res..

[22] Peter A. Flach,et al. Improved Dataset Characterisation for Meta-learning , 2002, Discovery Science.

[23] Jan Raes,et al. Inside two commercially available statistical expert systems , 1992 .

[24] Alberto Abelló,et al. Intelligent assistance for data pre-processing , 2018, Comput. Stand. Interfaces.

[25] H. V. Jagadish,et al. Foofah: A Programming-By-Example System for Synthesizing Data Transformation Programs , 2017, SIGMOD Conference.

[26] Melanie Hilario,et al. Model selection via meta-learning: a comparative study , 2000, Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000.

[27] Abraham Bernstein,et al. A survey of intelligent assistants for data analysis , 2013, CSUR.

[28] Paolo Papotti,et al. BigDansing: A System for Big Data Cleansing , 2015, SIGMOD Conference.

[29] Ahmed Eldawy,et al. NADEEF: a commodity data cleaning system , 2013, SIGMOD '13.

[30] Sanjay Krishnan,et al. ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning , 2016, SIGMOD Conference.

[31] Ron Kohavi,et al. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[32] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[33] Ihab F. Ilyas,et al. Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[34] Luís Torgo,et al. OpenML: networked science in machine learning , 2014, SKDD.

[35] Christophe G. Giraud-Carrier,et al. On the dangers of default implementations: The case of radial basis function networks , 2014, Intell. Data Anal..

[36] Mohammad Ghavamzadeh,et al. Automated Data Cleansing through Meta-Learning , 2017, AAAI.

[37] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[38] Paolo Papotti,et al. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.