Many systems have been developed for machine learning at scale. Performance has steadily improved, but there has been relatively little work on explicitly defining or approaching the limits of performance. In this paper we describe the application of roofline design, an approach borrowed from computer architecture, to large-scale machine learning. In roofline design, one exposes ALU, memory, and network limits, and the constraints they imply for algorithms. Using roofline design, we have developed a system called BIDMach which has demonstrated the highest performance to date for many ML problems. On one GPU-accelerated node, it generally outperforms other single-machine toolkits and cluster toolkits running on 100s of nodes. This performance level is enabled by a relatively small number of rooflined matrix primitives. Such performance implies a dramatic reduction in the energy used to perform these calculations. Beyond matrix kernels, roofline design can be applied to the end-to-end design of machine learning algorithms which minimize memory usage to optimize speed. This approach offers a further 2x to 3x gain in performance. Roofline design can also be applied to network primitives. We describe recent work on a sparse allreduce primitive called Kylix. We have shown that Kylix approaches the practical network throughput limit for allreduce, a basic primitive for distributed machine learning. Using Kylix, we describe an efficient transformation from model-parallel to data-parallel calculations. This transformation uses a secondary storage roofline, with similar parameters to the network. Finally, we describe several deployments of these techniques on real-world problems in two large internet companies. Once again, single node rooflined design demonstrated substantial gains over alternatives on either single nodes or clusters.
[1]
Michael J. Franklin,et al.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
,
2012,
NSDI.
[2]
Jeffrey Dean,et al.
Distributed Representations of Words and Phrases and their Compositionality
,
2013,
NIPS.
[3]
Samuel Williams,et al.
Roofline: an insightful visual performance model for multicore architectures
,
2009,
CACM.
[4]
Alexander J. Smola,et al.
Scaling Distributed Machine Learning with the Parameter Server
,
2014,
OSDI.
[5]
Christopher De Sa,et al.
Incremental Knowledge Base Construction Using DeepDive
,
2015,
The VLDB Journal.
[6]
John F. Canny,et al.
Kylix: A Sparse Allreduce for Commodity Clusters
,
2014,
2014 43rd International Conference on Parallel Processing.
[7]
John Langford,et al.
A reliable effective terascale linear learning system
,
2011,
J. Mach. Learn. Res..
[8]
John F. Canny,et al.
Sparse Allreduce: Efficient Scalable Communication for Power-Law Data
,
2013,
ArXiv.
[9]
Joseph E. Gonzalez,et al.
GraphLab: A New Parallel Framework for Machine Learning
,
2010
.
[10]
Guy E. Blelloch,et al.
GraphChi: Large-Scale Graph Computation on Just a PC
,
2012,
OSDI.
[11]
Joseph Gonzalez,et al.
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
,
2012,
OSDI.
[12]
Joseph M. Hellerstein,et al.
GraphLab: A New Framework For Parallel Machine Learning
,
2010,
UAI.
[13]
John F. Canny,et al.
Big data analytics with small footprint: squaring the cloud
,
2013,
KDD.