Building a Framework for High-performance In-memory Message-Oriented Middleware

Message-Oriented Middleware (MOM) is a popular class of software used in many distributed applications, ranging from business systems and social networks to gaming and streaming media services. As workloads continue to grow both in terms of the number of users and the amount of content, modern MOM systems face increasing demands in terms of performance and scalability. Recent advances in networking such as Remote Direct Memory Access (RDMA) offer a more efficient data transfer mechanism compared to traditional kernel-level socket networking used by existing widely-used MOM systems. Unfortunately, RDMA’s complex interface has made it difficult for MOM systems to utilize its capabilities. In this thesis we introduce a framework called RocketBufs, which provides abstractions and interfaces for constructing high-performance MOM systems. Applications implemented using RocketBufs produce and consume data using regions of memory called buffers while the framework is responsible for transmitting, receiving and synchronizing buffer access. RocketBufs’ buffer abstraction is designed to work efficiently with different transport protocols, allowing messages to be distributed using RDMA or TCP using the same APIs (i.e., by simply changing a configuration file). We demonstrate the utility and evaluate the performance of RocketBufs by using it to implement a publish/subscribe system called RBMQ. We compare it against two widelyused, industry-grade MOM systems, namely RabbitMQ and Redis. Our evaluations show that when using TCP, RBMQ achieves up to 1.9 times higher messaging throughput than RabbitMQ, a message queuing system with an equivalent flow control scheme. When RDMA is used, RBMQ shows significant gains in messaging throughput (up to 3.7 times higher than RabbitMQ and up to 1.7 times higher than Redis), as well as reductions in median delivery latency (up to 81% lower than RabbitMQ and 47% lower than Redis). In addition, on RBMQ subscriber hosts configured to use RDMA, data transfers occur with negligible CPU overhead regardless of the amount of data being transferred. This allows CPU resources to be used for other purposes like processing data. To further demonstrate the flexibility of RocketBufs, we use it to build a live streaming video application by integrating RocketBufs into a web server to receive disseminated video data. When compared with the same application built with Redis, the RocketBufs-based dissemination host achieves live streaming throughput up to 73% higher while disseminating data, and the RocketBufs-based web server shows a reduction of up to 95% in CPU utilization, allowing for up to 55% more concurrent viewers to be serviced.

[1]  Tim Brecht,et al.  Methodologies for generating HTTP streaming video workloads to evaluate web server performance , 2012, SYSTOR '12.

[2]  Yoav Tock,et al.  SpiderCast: a scalable interest-aware overlay for topic-based pub/sub communication , 2007, DEBS '07.

[3]  Philippe Dobbelaere,et al.  Kafka versus RabbitMQ: A comparative study of two industry reference publish/subscribe implementations: Industry Paper , 2017, DEBS.

[4]  Bo Li,et al.  Coping With Heterogeneous Video Contributors and Viewers in Crowdsourced Live Streaming: A Cloud-Based Approach , 2016, IEEE Transactions on Multimedia.

[5]  David Mosberger,et al.  httperf—a tool for measuring web server performance , 1998, PERV.

[6]  Christof Fetzer,et al.  StreamHub: a massively parallel architecture for high-performance content-based publish/subscribe , 2013, DEBS '13.

[7]  S. Narravula,et al.  Design and evaluation of benchmarks for financial applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand , 2008, 2008 Workshop on High Performance Computational Finance.

[8]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[9]  Haibo Chen,et al.  Fast In-Memory Transaction Processing Using RDMA and HTM , 2017, ACM Trans. Comput. Syst..

[10]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[11]  Randall Stewart,et al.  Optimizing TLS for High–Bandwidth Applications in FreeBSD , 2015 .

[12]  Marcos K. Aguilera,et al.  Designing Far Memory Data Structures: Think Outside the Box , 2019, HotOS.

[13]  David R. Cheriton,et al.  Comparing the performance of web server architectures , 2007, EuroSys '07.

[14]  Jörg Kienzle,et al.  Dynamoth: A Scalable Pub/Sub Middleware for Latency-Constrained Applications in the Cloud , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[15]  Werner Almesberger,et al.  Linux Network Traffic Control -- Implementation Overview , 1999 .

[16]  Jörg Kienzle,et al.  MultiPub: Latency and Cost-Aware Global-Scale Cloud Publish/Subscribe , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[17]  Maarten van Steen,et al.  Cost-Effective Resource Allocation for Deploying Pub/Sub on Cloud , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[18]  Dan Tsafrir,et al.  Storm: a fast transactional dataplane for remote data structures , 2019, SYSTOR.

[19]  Randy H. Katz,et al.  DeTail: reducing the flow completion time tail in datacenter networks , 2012, SIGCOMM '12.

[20]  Srinivasan Seshan,et al.  FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds , 2019, NSDI.

[21]  Martin Thomson,et al.  Hypertext Transfer Protocol Version 2 (HTTP/2) , 2015, RFC.

[22]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[23]  Yoav Tock,et al.  BeaConvey: Co-Design of Overlay and Routing for Topic-based Publish/Subscribe on Small-World Networks , 2018, DEBS.

[24]  Marcos K. Aguilera,et al.  Remote regions: a simple abstraction for remote memory , 2018, USENIX ATC.

[25]  Enhong Chen,et al.  KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC , 2017, SOSP.

[26]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[27]  Scott Shenker,et al.  Revisiting network support for RDMA , 2018, SIGCOMM.

[28]  Vivien Quéma,et al.  The Linux scheduler: a decade of wasted cores , 2016, EuroSys.

[29]  Yiying Zhang,et al.  LITE Kernel RDMA Support for Datacenter Applications , 2017, SOSP.

[30]  Pekka Nikander,et al.  LIPSIN: line speed publish/subscribe inter-networking , 2009, SIGCOMM '09.

[31]  Steve Uhlig,et al.  Internet Scale User-Generated Live Video Streaming: The Twitch Case , 2017, PAM.

[32]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[33]  Tim Brecht,et al.  accept()able Strategies for Improving Web Server Performance , 2004, USENIX ATC, General Track.

[34]  Andréa W. Richa,et al.  Minimum Maximum Degree Publish-Subscribe Overlay Network Design , 2009, IEEE INFOCOM 2009.

[35]  Diego López-de-Ipiña,et al.  An Open and Scalable Web-Based Interactive Live-Streaming architecture: The WILSP Platform , 2017, IEEE Access.

[36]  Sayantan Sur,et al.  A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[37]  Madhusudhan Govindaraju,et al.  DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[38]  Tim Brecht,et al.  Nessie: A Decoupled, Client-Driven Key-Value Store Using RDMA , 2017, IEEE Transactions on Parallel and Distributed Systems.

[39]  Toyokazu Akiyama,et al.  Scalable and Locality-Aware Distributed Topic-Based Pub/Sub Messaging for IoT , 2014, 2015 IEEE Global Communications Conference (GLOBECOM).

[40]  Marko Vukolic,et al.  Hyperledger fabric: a distributed operating system for permissioned blockchains , 2018, EuroSys.

[41]  Vyas Sekar,et al.  Understanding the impact of video quality on user engagement , 2011, SIGCOMM.

[42]  Robert D. Russell,et al.  A Performance Study to Guide RDMA Programming Decisions , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[43]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[44]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[45]  Martin Thomson,et al.  QUIC: A UDP-Based Multiplexed and Secure Transport , 2020, RFC.

[46]  Jun Li,et al.  Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services , 2015, NSDI.

[47]  Hans-Arno Jacobsen,et al.  Algorithms Based on Divide and Conquer for Topic-Based Publish/Subscribe Overlay Design , 2016, IEEE/ACM Transactions on Networking.

[48]  Gwendal Simon,et al.  DASH in Twitch: Adaptive Bitrate Streaming in Live Game Streaming Platforms , 2014, VideoNext '14.

[49]  Anne-Marie Kermarrec,et al.  The many faces of publish/subscribe , 2003, CSUR.