This paper describes recent work on the BIDMach toolkit for large-scale machine learning. BIDMach has demonstrated single-node performance that exceeds that of published cluster systems for many common machine-learning task. BIDMach makes full use of both CPU and GPU acceleration (through a sister library BIDMat), and requires only modest hardware (commodity GPUs). One of the challenges of reaching this level of performance is the allocation barrier. While it is simple and expedient to allocate and recycle matrix (or graph) objects in expressions, this approach is too slow to match the arithmetic throughput possible on either GPUs or CPUs. In this paper we describe a caching approach that allows code with complex matrix (graph) expressions to run at massive scale, i.e. multi-terabyte data, with zero memory allocation after initial start-up. We present a number of new benchmarks that leverage this approach.
[1]
Francis R. Bach,et al.
Online Learning for Latent Dirichlet Allocation
,
2010,
NIPS.
[2]
Alexander J. Smola,et al.
An architecture for parallel topic models
,
2010,
Proc. VLDB Endow..
[3]
Zhigang Luo,et al.
Online Nonnegative Matrix Factorization With Robust Stochastic Approximation
,
2012,
IEEE Transactions on Neural Networks and Learning Systems.
[4]
John F. Canny,et al.
Big data analytics with small footprint: squaring the cloud
,
2013,
KDD.
[5]
John F. Canny,et al.
Sparse Allreduce: Efficient Scalable Communication for Power-Law Data
,
2013,
ArXiv.
[6]
Frédo Durand,et al.
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
,
2013,
PLDI 2013.