P4ML: A Phased Performance-Based Pipeline Planner for Automated Machine Learning

While many problems could benefit from recent advances in machine learning, significant time and expertise are required to design customized solutions to each problem. Prior attempts to automate machine learning have focused on generating multi-step solutions composed of primitive steps for feature engineering and modeling, but using already clean and featurized data and carefully curated primitives. However, cleaning and featurization are often the most time-consuming steps in a data science pipeline. We present a novel approach that works with naturally occurring data of any size and type, and with diverse third-party data processing and modeling primitives that can lead to better quality solutions. The key idea is to generate multi-step pipelines (or workflows) by factoring the search for solutions into phases that apply a different expert-like strategy designed to improve performance. This approach is implemented in the P4ML system, and demonstrates superior performance over other systems on a variety of raw datasets.

[1]  Lars Schmidt-Thieme,et al.  Scalable Gaussian process-based transfer surrogates for hyperparameter optimization , 2017, Machine Learning.

[2]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[3]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[4]  Yan Liu,et al.  A Framework for Efficient Data Analytics through Automatic Configuration and Customization of Scientific Workflows , 2011, 2011 IEEE Seventh International Conference on eScience.

[5]  Frank Hutter,et al.  Initializing Bayesian Hyperparameter Optimization via Meta-Learning , 2015, AAAI.

[6]  Jaume Bacardit Applications of evolutionary computation: 19th European conference, Evoapplications 2016 Porto, Portugal, March 30 – April 1, 2016 proceedings, part II , 2016 .

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Bogdan Gabrys,et al.  Metalearning: a survey of trends and technologies , 2013, Artificial Intelligence Review.

[9]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[10]  Melih Elibol,et al.  Probabilistic Matrix Factorization for Automated Machine Learning , 2017, NeurIPS.

[11]  Sergio Escalera,et al.  A brief Review of the ChaLearn AutoML Challenge: Any-time Any-dataset Learning without Human Intervention , 2016, AutoML@ICML.

[12]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[13]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[14]  Ricardo Vilalta,et al.  Metalearning - Applications to Data Mining , 2008, Cognitive Technologies.

[15]  Paul R. Cohen,et al.  Intelligent Support for Exploratory Data Analysis , 1998 .

[16]  Randal S. Olson,et al.  Automating Biomedical Data Science Through Tree-Based Pipeline Optimization , 2016, EvoApplications.