Optimus Prime: Accelerating Data Transformation in Servers

Modern online services are shifting away from monolithic applications to loosely-coupled microservices because of their improved scalability, reliability, programmability and development velocity. Microservices communicating over the datacenter network require data transformation (DT) to convert messages back and forth between their internal formats. This work identifies DT as a bottleneck due to reductions in latency of the surrounding system components, namely application runtimes, protocol stacks, and network hardware. We therefore propose Optimus Prime (OP), a programmable DT accelerator that uses a novel abstraction, an in-memory schema, to represent DT operations. The schema is compatible with today's DT frameworks and enables any compliant accelerator to perform the transformations comprising a request in parallel. Our evaluation shows that OP's DT throughput matches the line rate of today's NICs and has ~60x higher throughput compared to software, at a tiny fraction of the CPU's silicon area and power. We also evaluate a set of microservices running on Thrift, and show up to 30% reduction in service latency.

[1]  Reetuparna Das,et al.  Parallel automata processor , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[2]  Andrew A. Chien,et al.  UDP: A Programmable Accelerator for Extract-Transform-Load Workloads and More , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Karl-Heinz Krempels,et al.  A Structured Approach to Support Collaborative Design, Specification and Documentation of Communication Protocols , 2018, ENASE.

[4]  Yong Wang,et al.  Overload Control for Scaling WeChat Microservices , 2018, SoCC.

[5]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[6]  Andrew W. Moore,et al.  Understanding PCIe performance for end host networking , 2018, SIGCOMM.

[7]  Thomas F. Wenisch,et al.  µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[8]  Dean M. Tullsen,et al.  Multithreading Architecture , 2013, Multithreading Architecture.

[9]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[10]  Thomas F. Wenisch,et al.  HARE: Hardware accelerator for regular expressions , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[12]  Gu-Yeon Wei,et al.  Profiling a Warehouse-Scale Computer , 2016, IEEE Micro.

[13]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[14]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[15]  Scott A. Mahlke,et al.  A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16]  Qi Huang,et al.  SVE: Distributed Video Processing at Facebook Scale , 2017, SOSP.

[17]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[18]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[19]  Michael Kaminsky,et al.  Datacenter RPCs can be General and Fast , 2018, NSDI.

[20]  Dave Brown,et al.  Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[21]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[22]  David Wentzlaff,et al.  Power and Energy Characterization of an Open Source 25-Core Manycore Processor , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Luca Benini,et al.  Towards near-threshold server processors , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Yang Li,et al.  Service fabric: a distributed platform for building microservices in the cloud , 2018, EuroSys.

[25]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[26]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[27]  Christina Delimitrou,et al.  The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[28]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[29]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[30]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[31]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[32]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[33]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[34]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[35]  Ying Zhang,et al.  FBOSS: building switch software at scale , 2018, SIGCOMM.