RABIT : A Reliable Allreduce and Broadcast Interface

Allreduce is an abstraction commonly used for solving machine learning problems. It is an operation where every node starts with a local value and ends up with an aggregate global result. MPI provides an Allreduce implementation. Though it has been widely adopted, it is somewhat limited; it lacks fault tolerance and cannot run easily on existing systems. In this work, we propose RABIT1, an Allreduce library suitable for distributed machine learning algorithms that overcomes the aforementioned drawbacks; it is faulttolerant and can easily run on top of existing systems. We compare RABIT with existing solutions and show that it performs competitively.