Elastic Machine Learning Algorithms in Amazon SageMaker

There is a large body of research on scalable machine learning (ML). Nevertheless, training ML models on large, continuously evolving datasets is still a difficult and costly undertaking for many companies and institutions. We discuss such challenges and derive requirements for an industrial-scale ML platform. Next, we describe the computational model behind Amazon SageMaker, which is designed to meet such challenges. SageMaker is an ML platform provided as part of Amazon Web Services (AWS), and supports incremental training, resumable and elastic learning as well as automatic hyperparameter optimization. We detail how to adapt several popular ML algorithms to its computational model. Finally, we present an experimental evaluation on large datasets, comparing SageMaker to several scalable, JVM-based implementations of ML algorithms, which we significantly outperform with regard to computation time and cost.

[1]  C. A. R. Hoare Algorithm 63: partition , 1961, CACM.

[2]  C. A. R. Hoare,et al.  Algorithm 65: find , 1961, Commun. ACM.

[3]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[5]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[6]  Rob J Hyndman,et al.  Automatic Time Series Forecasting: The forecast Package for R , 2008 .

[7]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[8]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[9]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[10]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[11]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[12]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[13]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[14]  Steffen Rendle,et al.  Factorization Machines with libFM , 2012, TIST.

[15]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[16]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[17]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[18]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[19]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[20]  Venu Satuluri,et al.  Factorbird - a Parameter Server Approach to Distributed Matrix Factorization , 2014, ArXiv.

[21]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[22]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[23]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[24]  Maxim Sviridenko,et al.  An Algorithm for Online K-Means Clustering , 2014, ALENEX.

[25]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[26]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[27]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[28]  Phil Blunsom,et al.  Neural Variational Inference for Text Processing , 2015, ICML.

[29]  Alexander J. Smola,et al.  DiFacto: Distributed Factorization Machines , 2016, WSDM.

[30]  Edo Liberty,et al.  Optimal Quantile Approximation in Streams , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[31]  Shirish Tatikonda,et al.  SystemML: Declarative Machine Learning on Spark , 2016, Proc. VLDB Endow..

[32]  Tilmann Rabl,et al.  Benchmarking Data Flow Systems for Scalable Machine Learning , 2017, BeyondMR@SIGMOD.

[33]  Valentin Flunkert,et al.  DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks , 2017, International Journal of Forecasting.

[34]  Joos-Hendrik Böse,et al.  Probabilistic Demand Forecasting at Scale , 2017, Proc. VLDB Endow..

[35]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[36]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[37]  Felix Bießmann,et al.  On Challenges in Machine Learning Model Management , 2018, IEEE Data Eng. Bull..

[38]  Syama Sundar Rangapuram,et al.  Deep Learning for Forecasting: Current Trends and Challenges , 2018 .

[39]  Syama Sundar Rangapuram,et al.  Deep Learning for Forecasting , 2018 .

[40]  Sebastian Schelter,et al.  Differential Data Quality Verification on Partitioned Data , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[41]  Stephan Kolassa,et al.  A Classification of Business Forecasting Problems , 2019 .

[42]  Stefanie N. Lindstaedt,et al.  SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle , 2019, CIDR.