EQueue: Elastic Lock-Free FIFO Queue for Core-to-Core Communication on Multi-Core Processors

In recent years, the number of CPU cores in a multi-core processor keeps increasing. To leverage the increasing hardware resource, programmers need to develop parallelized software programs. One promising approach to parallelizing high-performance applications is pipeline parallelism, which divides a task into a serial of subtasks and then maps these subtasks to a group of CPU cores, making the communication scheme between the subtasks running on different cores a critical component for the parallelized programs. One widely-used implementation of the communication scheme is software-based, lock-free first-in-first-out queues that move data between different subtasks. The primary design goal of the prior lock-free queues was higher throughput, such that the technique of batching data was heavily used in their enqueue and dequeue operations. Unfortunately, a lock-free queue with batching heavily depends on the assumption that data arrive at a constant rate, and the queue is in an equilibrium state. Experimentally we found that the equilibrium state of a queue rarely happens in real, high-performance use cases (e.g., 10Gbps+ network applications) because data arriving rate fluctuates sharply. As a result, existing queues suffer from performance degradation when used in real applications on multi-core processors. In this paper, we present EQueue, a lock-free queue to handle this robustness issue in existing queues. EQueue is lock-free, efficient, and robust. EQueue can adaptively (1) shrink its queue size when data arriving rate is low to keep its memory footprint small to utilize CPU cache better, and (2) enlarge its queue size to avoid overflow when data arriving rate is in burstiness. Experimental results show that when used in high-performance applications, EQueue can always perform an enqueue/dequeue operation in less than 50 CPU cycles, which outperforms FastForward and MCRingBuffer, two state-of-the-art queues, by factors 3 and 2, respectively.

[1]  Xinan Tang,et al.  B-Queue: Efficient and Practical Queuing for Fast Core-to-Core Communication , 2012, International Journal of Parallel Programming.

[2]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[3]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[4]  Maged M. Michael,et al.  Non-Blocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors , 1998 .

[5]  Yi Zhang,et al.  A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems , 2001, SPAA '01.

[6]  Erez Petrank,et al.  Wait-free queues with multiple enqueuers and dequeuers , 2011, PPoPP '11.

[7]  Timothy M. Jones,et al.  Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication , 2016, ICS.

[8]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[9]  Tao Li,et al.  A Flexible Communication Mechanism for Pipeline Parallelism , 2017, 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC).

[10]  Leslie Lamport,et al.  Specifying Concurrent Program Modules , 1983, TOPL.

[11]  Nir Shavit,et al.  An optimistic approach to lock-free FIFO queues , 2004, Distributed Computing.

[12]  Samuel P. Midkiff,et al.  Expressing and exploiting concurrency in networked applications with aspen , 2007, PPoPP.

[13]  Mark Moir,et al.  Using elimination to implement scalable and lock-free FIFO queues , 2005, SPAA '05.

[14]  Xinan Tang,et al.  Practice of parallelizing network applications on multi-core architectures , 2009, ICS '09.

[15]  Gurindar S. Sohi,et al.  Serialization sets: a dynamic dependence-based parallel execution model , 2009, PPoPP '09.

[16]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[17]  David I. August,et al.  Liberty Queues for EPIC Architectures , 2010 .

[18]  John Giacomoni,et al.  Visualizing potential parallelism in sequential programs , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  John Giacomoni,et al.  FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue , 2008, PPoPP.

[20]  Massimo Torquati,et al.  Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems , 2010, ArXiv.

[21]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[22]  Christof Fetzer,et al.  FFQ: A Fast Single-Producer/Multiple-Consumer Concurrent FIFO Queue , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Patrick P. C. Lee,et al.  A lock-free, cache-efficient multi-core synchronization mechanism for line-rate network traffic monitoring , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[24]  Maged M. Michael,et al.  Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors , 1998, J. Parallel Distributed Comput..

[25]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[26]  Alexander L. Wolf,et al.  Frame shared memory: line-rate networking on commodity hardware , 2007, ANCS '07.

[27]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[28]  Bertil Folliot,et al.  BatchQueue: Fast and Memory-Thrifty Core to Core Communication , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing.

[29]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).