AutoML to Date and Beyond: Challenges and Opportunities

Automated machine learning (AutoML) is essentially automating the process of applying machine learning to real-world problems. The primary goals of AutoML tools are to provide methods and processes to make Machine Learning available for non-Machine Learning experts (domain experts), to improve efficiency of Machine Learning and to accelerate research on Machine Learning. Although automation and efficiency are some of AutoML's main selling points, the process still requires a surprising level of human involvement. A number of vital steps of the machine learning pipeline, including understanding the attributes of domain-specific data, defining prediction problems, creating a suitable training data set etc. still tend to be done manually by a data scientist on an ad-hoc basis. Often, this process requires a lot of back-and-forth between the data scientist and domain experts, making the whole process more difficult and inefficient. Altogether, AutoML systems are still far from a "real automatic system". In this review article, we present a level-wise taxonomic perspective on AutoML systems to-date and beyond, i.e., we introduce a new classification system with seven levels to distinguish AutoML systems based on their level of autonomy. We first start with a discussion on how an end-to-end Machine learning pipeline actually looks like and which sub-tasks of Machine learning Pipeline has indeed been automated so far. Next, we highlight the sub-tasks which are still done manually by a data-scientist in most cases and how that limits a domain expert's access to Machine learning. Then, we introduce the novel level-based taxonomy of AutoML systems and define each level according to their scope of automation support. Finally, we provide a road-map of future research endeavor in the area of AutoML and discuss some important challenges in achieving this ambitious goal.

[1]  Shubhra Kanti Karmaker Santu,et al.  On Application of Learning to Rank for E-Commerce Search , 2017, SIGIR.

[2]  Shubhra Kanti Karmaker Santu,et al.  MLFriend: Interactive Prediction Task Recommendation for Event-Driven Time-Series Data , 2019, ArXiv.

[3]  Kalyan Veeramachaneni,et al.  The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development , 2019, SIGMOD Conference.

[4]  Maya Gokhale,et al.  ClearView: Data cleaning for online review mining , 2016, 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[5]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[6]  Shubhra Kanti Karmaker Santu,et al.  A Study of Feature Construction for Text-based Forecasting of Time Series Variables , 2017, CIKM.

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[8]  Chongcheng Chen,et al.  Data quality analysis and cleaning strategy for wireless sensor networks , 2018, EURASIP J. Wirel. Commun. Netw..

[9]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[10]  Ding Wang,et al.  Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey , 2015, International Journal of Automation and Computing.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[13]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[14]  Kalyan Veeramachaneni,et al.  Deep feature synthesis: Towards automating data science endeavors , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[15]  Sanjay Krishnan,et al.  Wisteria: Nurturing Scalable Data Cleaning Infrastructure , 2015, Proc. VLDB Endow..

[16]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[17]  Ramesh Raskar,et al.  Accelerating Neural Architecture Search using Performance Prediction , 2017, ICLR.

[18]  Juliana Freire,et al.  AlphaD3M: Machine Learning Pipeline Synthesis , 2021, ArXiv.

[19]  Yannis Tzitzikas,et al.  How Linked Data can Aid Machine Learning-Based Tasks , 2017, TPDL.

[20]  Beatriz de la Iglesia,et al.  Survey on Feature Selection , 2015, ArXiv.

[21]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[22]  Kaiyong Zhao,et al.  AutoML: A Survey of the State-of-the-Art , 2019, Knowl. Based Syst..

[23]  Dursun Delen,et al.  An assessment and cleaning framework for electronic health records data , 2018 .

[24]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[25]  Kevin Leyton-Brown,et al.  Efficient Benchmarking of Hyperparameter Optimizers via Surrogates , 2015, AAAI.

[26]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[27]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[28]  Markus Hofmann,et al.  RapidMiner: Data Mining Use Cases and Business Analytics Applications , 2013 .

[29]  Stephen H. Bach,et al.  Snorkel: rapid training data creation with weak supervision , 2019, The VLDB Journal.

[30]  Yin Jian-xin A survey of feature selection algorithm , 2011 .

[31]  James Geller,et al.  Data Mining: Practical Machine Learning Tools and Techniques - Book Review , 2002, SIGMOD Rec..

[32]  Paolo Papotti,et al.  KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing , 2015, Proc. VLDB Endow..

[33]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[34]  Ya Zhang,et al.  Active Learning for Ranking through Expected Loss Optimization , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35]  Mohammad Ali Zare Chahooki,et al.  A Survey on semi-supervised feature selection methods , 2017, Pattern Recognit..

[36]  Philip S. Yu,et al.  Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing , 2017, Proc. VLDB Endow..

[37]  Janez Demšar,et al.  ORANGE : DATA MINING FRUITFUL AND FUN , 2012 .

[38]  Lars Schmidt-Thieme,et al.  Beyond Manual Tuning of Hyperparameters , 2015, KI - Künstliche Intelligenz.

[39]  Teck Khim Ng,et al.  Rafiki , 2018, Proceedings of the VLDB Endowment.

[40]  Suzanne van den Bosch,et al.  Automatic feature generation and selection in predictive analytics solutions , 2017 .

[41]  Kalyan Veeramachaneni,et al.  Label, Segment, Featurize: A Cross Domain Framework for Prediction Engineering , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[42]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[43]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[44]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[45]  Kalyan Veeramachaneni,et al.  What Would a Data Scientist Ask? Automatically Formulating and Solving Predictive Problems , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[46]  Qiang Wang,et al.  Benchmarking State-of-the-Art Deep Learning Software Tools , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[47]  Vikram Pudi,et al.  AutoLearn — Automated Feature Generation and Selection , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[48]  Arun Ross,et al.  ATM: A distributed, collaborative, scalable system for automated machine learning , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[49]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[50]  Laura Gustafson Bayesian tuning and bandits : an extensible, open source library for AutoML , 2018 .

[51]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[52]  Rebecca Sparrow,et al.  LinkedIn , 2021, Nachrichten aus der Chemie.

[53]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[54]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[55]  Beng Chin Ooi,et al.  Rafiki: Machine Learning as an Analytics Service System , 2018, Proc. VLDB Endow..

[56]  Dawn Xiaodong Song,et al.  ExploreKit: Automatic Feature Generation and Selection , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[57]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[58]  Ya Zhang,et al.  Active Learning for Ranking through Expected Loss Optimization , 2015, IEEE Trans. Knowl. Data Eng..

[59]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[60]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[61]  Lei Chen,et al.  Interference cancelation scheme with variable bandwidth allocation for universal filtered multicarrier systems in 5G networks , 2018, EURASIP J. Wirel. Commun. Netw..

[62]  Oriol Vinyals,et al.  Hierarchical Representations for Efficient Architecture Search , 2017, ICLR.