Flare: flexible in-network allreduce
暂无分享,去创建一个
[1] Marco Canini,et al. Efficient sparse collective communication and its application to accelerate distributed deep learning , 2020, SIGCOMM.
[2] Jens Domke,et al. Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Valentin Petrov,et al. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation , 2020, ISC.
[4] John F. Canny,et al. Sparse Allreduce: Efficient Scalable Communication for Power-Law Data , 2013, ArXiv.
[5] Jari Nurmi,et al. Flexible Software-Defined Packet Processing Using Low-Area Hardware , 2020, IEEE Access.
[6] Pavan Balaji,et al. On the Reproducibility of MPI Reduction Operations , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.
[7] Torsten Hoefler,et al. Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks , 2014, HPDC '14.
[8] Jacob Nelson,et al. When Should The Network Be The Computer? , 2019, HotOS.
[9] Michael M. Swift,et al. ATP: In-network Aggregation for Multi-tenant Learning , 2021, NSDI.
[10] George Varghese,et al. Design principles for packet parsers , 2013, Architectures for Networking and Communications Systems.
[11] John F. Canny,et al. Kylix: A Sparse Allreduce for Commodity Clusters , 2014, 2014 43rd International Conference on Parallel Processing.
[12] Nicholas J. Wright,et al. Understanding Performance Variability on the Aries Dragonfly Network , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[13] J D Littler,et al. A PROOF OF THE QUEUING FORMULA , 1961 .
[14] Luca Benini,et al. FPnew: An Open-Source Multiformat Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing , 2020, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[15] Luca Benini,et al. ATUNs: Modular and Scalable Support for Atomic Operations in a Shared Memory Multiprocessor , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).
[16] Roberto Bifulco,et al. A Survey on the Programmable Data Plane: Abstractions, Architectures, and Open Problems , 2018, 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR).
[17] George Varghese,et al. P4: programming protocol-independent packet processors , 2013, CCRV.
[18] Francisco J. Cazorla,et al. A Quantitative Analysis of OS Noise , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[19] Luca Benini,et al. Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.
[20] Laxmikant V. Kalé,et al. Quantifying Network Contention on Large Parallel Machines , 2009, Parallel Process. Lett..
[21] Nan Jiang,et al. An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[22] E. Lorenz,et al. The predictability of a flow which possesses many scales of motion , 1969 .
[23] J. P. Grossman,et al. Filtering, Reductions and Synchronization in the Anton 2 Network , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[24] Luca Benini,et al. A RISC-V in-network accelerator for flexible high-performance low-power packet processing , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).
[25] Torsten Hoefler,et al. Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.
[26] Torsten Hoefler,et al. The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.
[27] Alvin Cheung,et al. Packet Transactions: High-Level Programming for Line-Rate Switches , 2015, SIGCOMM.
[28] Nathalie Revol,et al. Numerical Reproducibility and Parallel Computations: Issues for Interval Algorithms , 2014, IEEE Transactions on Computers.
[29] Jacob Nelson,et al. Evaluating the Power of Flexible Packet Processing for Network Resource Allocation , 2017, NSDI.
[30] Panos Kalnis,et al. Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.
[31] Torsten Hoefler,et al. An In-Depth Analysis of the Slingshot Interconnect , 2020, ArXiv.
[32] Gottlieb,et al. Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. , 1987, Physical review. D, Particles and fields.
[33] Xin Yuan,et al. A Study of Process Arrival Patterns for MPI Collective Operations , 2007, ICS '07.
[34] Robert B. Ross,et al. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[35] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Jorge Crichigno,et al. An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends , 2021, IEEE Access.
[37] H. von Storch,et al. Limits of reproducibility and hydrodynamic noise in atmospheric regional modelling , 2021, Communications Earth & Environment.
[38] Torsten Hoefler,et al. Mitigating network noise on Dragonfly networks through application-aware routing , 2019, SC.
[39] D. Skinner,et al. Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..
[40] D. Roweth,et al. Cray XC ® Series Network , 2012 .
[41] Xin Yuan,et al. Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..
[42] Robert B. Ross,et al. Watch Out for the Bully! Job Interference Study on Dragonfly Network , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[43] Dan Alistarh,et al. SparCML: high-performance sparse communication for machine learning , 2018, SC.
[44] Sriram Krishnamoorthy,et al. Effects of floating-point non-associativity on numerical computations on massively multithreaded systems , 2009 .
[45] Hong Liu,et al. Energy proportional datacenter networks , 2010, ISCA.
[46] Michael Menth,et al. A Survey on Data Plane Programming with P4: Fundamentals, Advances, and Applied Research , 2021, ArXiv.
[47] Takayuki Okamoto,et al. The Tofu Interconnect D , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[48] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.
[49] Dan Alistarh,et al. Taming unbalanced training workloads in deep learning with partial collective operations , 2019, PPoPP.
[50] T. Hoefler,et al. Kilometer-Scale Climate Models: Prospects and Challenges , 2020 .
[51] Kevin Harms,et al. Run-to-run Variability on Xeon Phi based Cray XC Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[52] Torsten Hoefler,et al. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[53] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .
[54] Abhinav Bhatele,et al. Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[55] Sam Ade Jacobs,et al. Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).
[56] Luca Benini,et al. PULP: A parallel ultra low power platform for next generation IoT applications , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).
[57] Torsten Hoefler,et al. The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[58] Torsten Hoefler,et al. Evaluating the Cost of Atomic Operations on Modern Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[59] Dan Alistarh,et al. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks , 2021, J. Mach. Learn. Res..
[60] Shuo Liu,et al. NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration , 2020, ArXiv.
[61] Andrew Waterman,et al. The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .
[62] Charles E. Leiserson,et al. Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.
[63] Mike Dubman,et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).
[64] George Varghese,et al. Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.
[65] Torsten Hoefler,et al. Towards Efficient MapReduce Using MPI , 2009, PVM/MPI.
[66] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[67] Bruce Jacob,et al. The structural simulation toolkit , 2006, PERV.
[68] Luca Benini,et al. Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra , 2020, 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[69] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[70] Prabhat,et al. Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[71] David Defour,et al. Numerical reproducibility for the parallel reduction on multi- and many-core architectures , 2015, Parallel Comput..
[72] Torsten Hoefler,et al. sPIN: High-performance streaming Processing in the Network , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[73] Vincent Vanhoucke,et al. Improving the speed of neural networks on CPUs , 2011 .
[74] Paolo Costa,et al. In-network Aggregation for Shared Machine Learning Clusters , 2021, MLSys.
[75] Kevin Harms,et al. Characterization of MPI Usage on a Production Supercomputer , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[76] Salvatore Pontarelli,et al. FlowBlaze: Stateful Packet Processing in Hardware , 2019, NSDI.
[77] Nick Feamster,et al. The road to SDN: an intellectual history of programmable networks , 2014, CCRV.