Dolphin : Runtime Optimization for Distributed Machine Learning

Large-scale machine learning (ML) systems are becoming widely used. Typically, these ML systems run on fixed resources, but it is difficult to find their optimal configurations (e.g., how many nodes to use, how to distribute data) since they depend on multiple factors such as hardware environments, ML algorithms, input datasets, etc. Furthermore, optimal configurations often can change over time due to fluctuating cluster resources and changing ML algorithm patterns. In this paper, we present Dolphin, an elastic machine learning framework that addresses the configuration problem at runtime. Dolphin solves a cost-based optimization problem to find an optimal configuration and reconfigures the system dynamically at runtime. Dolphin introduces a new distributed memory abstraction to change resource and data configurations based on the optimizer plan transparently and efficiently.

[1]  Olatunji Ruwase,et al.  Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems , 2015, KDD.

[2]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[3]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[4]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[5]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[6]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[7]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[8]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[9]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[10]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[11]  Randy H. Katz,et al.  Wrangler: Predictable and Faster Jobs using Fewer Resources , 2014, SoCC.

[12]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[13]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[14]  Shirish Tatikonda,et al.  Resource Elasticity for Large-Scale Machine Learning , 2015, SIGMOD Conference.

[15]  Seunghak Lee,et al.  Solving the Straggler Problem with Bounded Staleness , 2013, HotOS.

[16]  Carlo Curino,et al.  REEF: Retainable Evaluator Execution Framework , 2013, Proc. VLDB Endow..