Peta-Scale Embedded Photonics Architecture for Distributed Deep Learning Applications

As Deep Learning (DL) models grow larger and more complex, training jobs are increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs. Each CU processes a sub-part of the model and synchronizes results with others. Communication among these CUs has emerged as a key bottleneck in the training process. In this work, we present SiPAC, a Silicon Photonic Accelerated Compute cluster. SiPAC accelerates distributed DL training by means of two co-designed components: a photonic physical layer and a novel collective algorithm. The physical layer exploits embedded photonics to bring peta-scale I/O directly to the CUs of a DL optimized cluster and uses resonator-based optical wavelength selectivity to realize hardware multi-casting. The collective algorithm builds on the hardware multi-casting primitive. This combination expedites a variety of collective communications commonly employed in DL training and has the potential to drastically ease the communication bottlenecks. We demonstrate the feasibility of realizing the SiPAC architecture through 1) an optical testbed experiment where an array of comb laser wavelengths are shuffled by a cascaded ring switch, with each ring selecting and forwarding multiple wavelengths to increase the effective communication bandwidth and hence demonstrating the hardware multicasting primitive, and 2) a four-GPU testbed running a realistic DL workload that achieves 22% system-level performance improvement relative to a similarly-sized leaf-spine topology. Large scale simulations show that SiPAC achieves a 1.4× to 5.9× communication time reduction compared to state-of-the-art compute clusters for representative collective communications.

[1]  Asher Novick,et al.  Dispersion-Engineered and Fabrication-Robust SOI Waveguides for Ultra-Broadband DWDM , 2023, Optical Fiber Communications Conference and Exhibition.

[2]  Asher Novick,et al.  Low-Loss Wide-FSR Miniaturized Racetrack Style Microring Filters for ⩾1 Tbps DWDM , 2023, 2023 Optical Fiber Communications Conference and Exhibition (OFC).

[3]  Liang Yuan Dai,et al.  SiP Architecture For Accelerating Collective Communication in Distributed Deep Learning , 2023, 2023 Optical Fiber Communications Conference and Exhibition (OFC).

[4]  Dan Li,et al.  Impact of Synchronization Topology on DML Performance: Both Logical Topology and Physical Topology , 2022, IEEE/ACM Transactions on Networking.

[5]  Liang Yuan Dai,et al.  Streamlined Architecture for Thermal Control and Stabilization of Cascaded DWDM Micro-Ring Filters Bus , 2022, Optical Fiber Communications Conference and Exhibition.

[6]  Zhihao Jia,et al.  TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs , 2022, NSDI.

[7]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[8]  Tushar Krishna,et al.  Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models , 2021, ArXiv.

[9]  Bok Young Kim,et al.  Integrated Kerr frequency comb-driven silicon photonic transmitter , 2021, 2109.10297.

[10]  Wei Jiang,et al.  Co-designing the Topology/Algorithm to Accelerate Distributed Training , 2021, 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom).

[11]  Madeleine Glick,et al.  SiP-ML: high-bandwidth optical network interconnects for machine learning training , 2021, SIGCOMM.

[12]  Truong Thao Nguyen,et al.  Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale , 2021, IEICE Trans. Inf. Syst..

[13]  Huaxi Gu,et al.  X-NEST: A Scalable, Flexible, and High-Performance Network Architecture for Distributed Machine Learning , 2021, Journal of Lightwave Technology.

[14]  T. Hoefler,et al.  Flare: Flexible In-Network Allreduce , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Jiayi Huang,et al.  Communication Algorithm-Architecture Co-Design for Distributed Deep Learning , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[16]  Nathan C. Abrams,et al.  3D-Integrated Multichip Module Transceiver for Terabit-Scale DWDM Interconnects , 2021, 2021 Optical Fiber Communications Conference and Exhibition (OFC).

[17]  Vladimir Stojanovic,et al.  8 Tbps Co-Packaged FPGA and Silicon Photonics Optical IO , 2021, 2021 Optical Fiber Communications Conference and Exhibition (OFC).

[18]  Bok Young Kim,et al.  Error-Free Kerr Comb-Driven SiP Microdisk Transmitter , 2021, 2021 Conference on Lasers and Electro-Optics (CLEO).

[19]  Olatunji Ruwase,et al.  ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Kiran Kumar Matam,et al.  Software-hardware co-design for fast and scalable training of deep learning recommendation models , 2021, ISCA.

[21]  Amar Phanishayee,et al.  Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  S. Yoo,et al.  Silicon Photonic Flex-LIONS for Reconfigurable Multi-GPU Systems , 2021, Journal of Lightwave Technology.

[23]  John E. Bowers,et al.  A Scalable Multicast Hybrid Broadband Crossbar Wavelength Selective Switch For Datacenters , 2021, 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC).

[24]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[25]  Nicholson T. Collier,et al.  High-bypass Learning: Automated Detection of Tumor Cells That Significantly Impact Drug Response , 2020, 2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S).

[26]  Marco Canini,et al.  Efficient sparse collective communication and its application to accelerate distributed deep learning , 2020, SIGCOMM.

[27]  Tushar Krishna,et al.  ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms , 2020, 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[28]  Qixiang Cheng,et al.  Silicon Photonic 2.5D Multi-Chip Module Transceiver for High-Performance Data Centers , 2020, Journal of Lightwave Technology.

[29]  Srinivas Sridharan,et al.  Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms , 2020, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[30]  Patricia Layec,et al.  Performance Model and Design Rules for Optical Systems Employing Low-Resolution DAC/ADC , 2020, Journal of Lightwave Technology.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  A. Boes,et al.  Ultra-dense optical data transmission over standard fibre with a single chip source , 2020, Nature Communications.

[33]  Gang Sun,et al.  PSNet: Reconfigurable network topology design for accelerating parameter server architecture based distributed machine learning , 2020, Future Gener. Comput. Syst..

[34]  Bor-Yiing Su,et al.  Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems , 2020, ArXiv.

[35]  Mark Wade,et al.  TeraPHY: A Chiplet Technology for Low-Power, High-Bandwidth In-Package Optical I/O , 2020, IEEE Micro.

[36]  Qixiang Cheng,et al.  Experimental Demonstration of PAM-4 Transmission through Microring Silicon Photonic Clos Switch Fabric , 2020, 2020 Optical Fiber Communications Conference and Exhibition (OFC).

[37]  Ryousei Takano,et al.  On the Feasibility of Hybrid Electrical/Optical Switch Architecture for Large-Scale Training of Distributed Deep Learning , 2019, 2019 IEEE/ACM Workshop on Photonics-Optics Technology Oriented Networking, Information and Computing Systems (PHOTONICS).

[38]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[39]  Michal Lipson,et al.  Turn-Key, High-Efficiency Kerr Comb Source , 2019, 2020 Conference on Lasers and Electro-Optics (CLEO).

[40]  Qixiang Cheng,et al.  Scalable Microring-Based Silicon Clos Switch Fabric With Switch-and-Select Stages , 2019, IEEE Journal of Selected Topics in Quantum Electronics.

[41]  Xu Liu,et al.  Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[42]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[43]  Marc Snir,et al.  Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems , 2018, 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC).

[44]  Qixiang Cheng,et al.  Recent advances in optical technologies for data centers: a review , 2018, Optica.

[45]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[46]  Qixiang Cheng,et al.  Design Space Exploration of Microring Resonators in Silicon Photonic Interconnects: Impact of the Ring Curvature , 2018, Journal of Lightwave Technology.

[47]  Xi Chen,et al.  Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects , 2018, 2018 IEEE Custom Integrated Circuits Conference (CICC).

[48]  Tomislav Drenski,et al.  ADC & DAC — Technology Trends and Steps to Overcome Current Limitations , 2018, 2018 Optical Fiber Communications Conference and Exposition (OFC).

[49]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[50]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Yaoliang Yu,et al.  Distributed Machine Learning via Sufficient Factor Broadcasting , 2015, ArXiv.

[55]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[56]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[57]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[58]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[59]  Dharma P. Agrawal,et al.  Generalized Hypercube and Hyperbus Structures for a Computer Network , 1984, IEEE Transactions on Computers.

[60]  Bok Young Kim,et al.  Petabit-Scale Silicon Photonic Interconnects With Integrated Kerr Frequency Combs , 2022, IEEE Journal of Selected Topics in Quantum Electronics.

[61]  Jie Training Deep Learning Recommendation Model with Quantized Collective Communications , 2020 .

[62]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[63]  H. Rong,et al.  A 128 Gb/s PAM4 Silicon Microring Modulator With Integrated Thermo-Optic Resonance Tuning , 2019, Journal of Lightwave Technology.

[64]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[65]  Kenji Tanaka,et al.  Large-Message Size Allreduce at Wire Speed for Distributed Deep Learning , 2018 .

[66]  Jianping Wu,et al.  BML: A High-performance, Low-cost Gradient Synchronization Algorithm for DML Training , 2018, NeurIPS.