论文信息 - Towards Resource-Elastic Machine Learning

Towards Resource-Elastic Machine Learning

The availability of powerful distributed data platforms and the widespread success of Machine Learning (ML) has led to a virtuous cycle wherein organizations are investing in gathering a wider range of (even bigger!) datasets and addressing an even broader range of tasks. The Hadoop Distributed File System (HDFS) is being provisioned to capture and durably store these datasets. Along side HDFS, resource managers like Mesos [10], Corona [8] and YARN [16] enable the allocation of compute resources “near the data,” where frameworks like REEF [3] can cache it and support fast iterative computations. Unfortunately, most ML algorithms are not tuned to operate on these new cloud platforms, where two new challenges arise: 1) scale-up: the need to acquire more resources dedicated to a particular algorithm, and 2) scale-down: the need to react to resource preemption. This paper focuses on the scale-down challenge, since it poses the most stringent requirement for executing on cloud platforms like YARN, which reserves the right to preempt compute resources dedicated to a job (tenant) [16].

[1] Chih-Jen Lin,et al. Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[2] Markus Weimer,et al. A Convenient Framework for Efficient Parallel Multipass Algorithms , 2010 .

[3] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[4] Cristina L. Abad,et al. Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters , 2013, SoCC.

[5] Jack Dongarra,et al. MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[6] Carlo Curino,et al. Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[7] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[8] Alexander J. Smola,et al. Scalable inference in latent variable models , 2012, WSDM '12.

[9] Joseph E. Gonzalez,et al. GraphLab: A New Parallel Framework for Machine Learning , 2010 .

[10] John Langford,et al. A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[11] Simon Haykin,et al. Neural Networks and Learning Machines , 2010 .

[12] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[13] Michael Kearns,et al. Efficient noise-tolerant learning from statistical queries , 1993, STOC.