Flare: flexible in-network allreduce

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches.

[1]  Marco Canini,et al.  Efficient sparse collective communication and its application to accelerate distributed deep learning , 2020, SIGCOMM.

[2]  Jens Domke,et al.  Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Valentin Petrov,et al.  Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation , 2020, ISC.

[4]  John F. Canny,et al.  Sparse Allreduce: Efficient Scalable Communication for Power-Law Data , 2013, ArXiv.

[5]  Jari Nurmi,et al.  Flexible Software-Defined Packet Processing Using Low-Area Hardware , 2020, IEEE Access.

[6]  Pavan Balaji,et al.  On the Reproducibility of MPI Reduction Operations , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[7]  Torsten Hoefler,et al.  Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks , 2014, HPDC '14.

[8]  Jacob Nelson,et al.  When Should The Network Be The Computer? , 2019, HotOS.

[9]  Michael M. Swift,et al.  ATP: In-network Aggregation for Multi-tenant Learning , 2021, NSDI.

[10]  George Varghese,et al.  Design principles for packet parsers , 2013, Architectures for Networking and Communications Systems.

[11]  John F. Canny,et al.  Kylix: A Sparse Allreduce for Commodity Clusters , 2014, 2014 43rd International Conference on Parallel Processing.

[12]  Nicholas J. Wright,et al.  Understanding Performance Variability on the Aries Dragonfly Network , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  J D Littler,et al.  A PROOF OF THE QUEUING FORMULA , 1961 .

[14]  Luca Benini,et al.  FPnew: An Open-Source Multiformat Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing , 2020, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Luca Benini,et al.  ATUNs: Modular and Scalable Support for Atomic Operations in a Shared Memory Multiprocessor , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[16]  Roberto Bifulco,et al.  A Survey on the Programmable Data Plane: Abstractions, Architectures, and Open Problems , 2018, 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR).

[17]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[18]  Francisco J. Cazorla,et al.  A Quantitative Analysis of OS Noise , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Luca Benini,et al.  Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[20]  Laxmikant V. Kalé,et al.  Quantifying Network Contention on Large Parallel Machines , 2009, Parallel Process. Lett..

[21]  Nan Jiang,et al.  An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[22]  E. Lorenz,et al.  The predictability of a flow which possesses many scales of motion , 1969 .

[23]  J. P. Grossman,et al.  Filtering, Reductions and Synchronization in the Anton 2 Network , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[24]  Luca Benini,et al.  A RISC-V in-network accelerator for flexible high-performance low-power packet processing , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[25]  Torsten Hoefler,et al.  Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[26]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[27]  Alvin Cheung,et al.  Packet Transactions: High-Level Programming for Line-Rate Switches , 2015, SIGCOMM.

[28]  Nathalie Revol,et al.  Numerical Reproducibility and Parallel Computations: Issues for Interval Algorithms , 2014, IEEE Transactions on Computers.

[29]  Jacob Nelson,et al.  Evaluating the Power of Flexible Packet Processing for Network Resource Allocation , 2017, NSDI.

[30]  Panos Kalnis,et al.  Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[31]  Torsten Hoefler,et al.  An In-Depth Analysis of the Slingshot Interconnect , 2020, ArXiv.

[32]  Gottlieb,et al.  Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. , 1987, Physical review. D, Particles and fields.

[33]  Xin Yuan,et al.  A Study of Process Arrival Patterns for MPI Collective Operations , 2007, ICS '07.

[34]  Robert B. Ross,et al.  Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jorge Crichigno,et al.  An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends , 2021, IEEE Access.

[37]  H. von Storch,et al.  Limits of reproducibility and hydrodynamic noise in atmospheric regional modelling , 2021, Communications Earth & Environment.

[38]  Torsten Hoefler,et al.  Mitigating network noise on Dragonfly networks through application-aware routing , 2019, SC.

[39]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[40]  D. Roweth,et al.  Cray XC ® Series Network , 2012 .

[41]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[42]  Robert B. Ross,et al.  Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Dan Alistarh,et al.  SparCML: high-performance sparse communication for machine learning , 2018, SC.

[44]  Sriram Krishnamoorthy,et al.  Effects of floating-point non-associativity on numerical computations on massively multithreaded systems , 2009 .

[45]  Hong Liu,et al.  Energy proportional datacenter networks , 2010, ISCA.

[46]  Michael Menth,et al.  A Survey on Data Plane Programming with P4: Fundamentals, Advances, and Applied Research , 2021, ArXiv.

[47]  Takayuki Okamoto,et al.  The Tofu Interconnect D , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[48]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[49]  Dan Alistarh,et al.  Taming unbalanced training workloads in deep learning with partial collective operations , 2019, PPoPP.

[50]  T. Hoefler,et al.  Kilometer-Scale Climate Models: Prospects and Challenges , 2020 .

[51]  Kevin Harms,et al.  Run-to-run Variability on Xeon Phi based Cray XC Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[53]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[54]  Abhinav Bhatele,et al.  Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[55]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[56]  Luca Benini,et al.  PULP: A parallel ultra low power platform for next generation IoT applications , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[57]  Torsten Hoefler,et al.  The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[58]  Torsten Hoefler,et al.  Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[59]  Dan Alistarh,et al.  Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , 2021, J. Mach. Learn. Res..

[60]  Shuo Liu,et al.  NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration , 2020, ArXiv.

[61]  Andrew Waterman,et al.  The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .

[62]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[63]  Mike Dubman,et al.  Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).

[64]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[65]  Torsten Hoefler,et al.  Towards Efficient MapReduce Using MPI , 2009, PVM/MPI.

[66]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[67]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[68]  Luca Benini,et al.  Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra , 2020, 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[69]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[70]  Prabhat,et al.  Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[71]  David Defour,et al.  Numerical reproducibility for the parallel reduction on multi- and many-core architectures , 2015, Parallel Comput..

[72]  Torsten Hoefler,et al.  sPIN: High-performance streaming Processing in the Network , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[73]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[74]  Paolo Costa,et al.  In-network Aggregation for Shared Machine Learning Clusters , 2021, MLSys.

[75]  Kevin Harms,et al.  Characterization of MPI Usage on a Production Supercomputer , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[76]  Salvatore Pontarelli,et al.  FlowBlaze: Stateful Packet Processing in Hardware , 2019, NSDI.

[77]  Nick Feamster,et al.  The road to SDN: an intellectual history of programmable networks , 2014, CCRV.