Specializing the network for scatter-gather workloads

Data processing and distributed querying workloads often involve a "scatter-gather" or "partition-aggregate" architectural pattern, whereby one application queries hundreds or even thousands of workers. Network communication is often a bottleneck in this pattern, especially when the compute task at each worker is small, such as for Web queries and interactive analytics. The network bottleneck can result in low throughput, high CPU utilization, and cause job completion time to increase by orders of magnitude. To overcome these inefficiencies, we explore hardware-offload of the scatter-gather primitive, whereby a smart NIC takes on the responsibility of sending out queries and collecting responses. We show that this approach not only virtually eliminates CPU usage, but with suitable scheduling of responses, it also speeds up scatter by allowing parallel queries, and gather by preventing throughput collapse due to excessive congestion. Besides response scheduling, we use a careful design at the NIC to limit FPGA resource usage: our approach uses about 25% of on-chip logic and 33% of on-chip memory on a mid-sized FPGA, leaving enough room for implementing other functions on the smart NIC.

[1]  Chuang Lin,et al.  Catch the Whole Lot in an Action: Rapid Precise Packet Loss Notification in Data Center , 2014, NSDI.

[2]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[3]  Gustavo Alonso,et al.  Limago: An FPGA-Based Open-Source 100 GbE TCP/IP Stack , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  Gustavo Alonso,et al.  Caribou: Intelligent Distributed Storage , 2017, Proc. VLDB Endow..

[6]  Brahim Bensaou,et al.  IncastGuard: An Efficient TCP-Incast Mitigation Mechanism for Cloud Networks , 2018, 2018 IEEE Global Communications Conference (GLOBECOM).

[7]  Ihsan Ayyub Qazi,et al.  RecFlow: SDN-based receiver-driven flow scheduling in datacenters , 2019, Cluster Computing.

[8]  Enhong Chen,et al.  KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC , 2017, SOSP.

[9]  Srikanth Kandula,et al.  Speeding up distributed request-response workflows , 2013, SIGCOMM.

[10]  Gustavo Alonso,et al.  Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[11]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[12]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[14]  Nestor Michael C. Tiglao,et al.  TCP incast solutions in data center networks: A classification and survey , 2019, J. Netw. Comput. Appl..

[15]  Gustavo Alonso,et al.  Consensus in a Box: Inexpensive Coordination in Hardware , 2016, NSDI.

[16]  Gustavo Alonso,et al.  Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure , 2020, SIGMOD Conference.

[17]  Kay Ousterhout,et al.  Architecting for Performance Clarity in Data Analytics Frameworks , 2017 .

[18]  Minlan Yu,et al.  HPCC: high precision congestion control , 2019, SIGCOMM.

[19]  David Sidler,et al.  StRoM: smart remote memory , 2020, EuroSys.

[20]  Junda Liu,et al.  Multi-enterprise networking , 2000 .

[21]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2010, IEEE/ACM Transactions on Networking.

[22]  Chunming Qiao,et al.  An Effective Approach to Preventing TCP Incast Throughput Collapse for Data Center Networks , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.

[23]  Shinji Shimojo,et al.  A Scalable Approach to Avoid Incast Problem from Application Layer , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference Workshops.

[24]  Carey L. Williamson,et al.  Solving the TCP-Incast Problem with Application-Level Scheduling , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[25]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[26]  Christoforos E. Kozyrakis,et al.  Mind the Gap: A Case for Informed Request Scheduling at the NIC , 2019, HotNets.

[27]  Gustavo Alonso,et al.  Low-latency TCP/IP stack for data center applications , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[28]  James R. Larus,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[29]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[30]  Mark Silberstein,et al.  Lynx: A SmartNIC-driven Accelerator-centric Architecture for Network Servers , 2020, ASPLOS.