论文信息 - RABIT : A Reliable Allreduce and Broadcast Interface

RABIT : A Reliable Allreduce and Broadcast Interface

Allreduce is an abstraction commonly used for solving machine learning problems. It is an operation where every node starts with a local value and ends up with an aggregate global result. MPI provides an Allreduce implementation. Though it has been widely adopted, it is somewhat limited; it lacks fault tolerance and cannot run easily on existing systems. In this work, we propose RABIT1, an Allreduce library suitable for distributed machine learning algorithms that overcomes the aforementioned drawbacks; it is faulttolerant and can easily run on top of existing systems. We compare RABIT with existing solutions and show that it performs competitively.

Tianqi Chen | Ignacio Cano | Tianyi Zhou

[1] Zhaohui Zheng,et al. Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[2] John Langford,et al. A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[3] LangfordJohn,et al. A reliable effective terascale linear learning system , 2014 .