A Dynamic Multi-Threaded Queuing Mechanism for Reducing the Inter-Process Communication Latency on Multi-Core Chips

Reducing latency in inter-process/inter-thread communication is one of the key challenges in parallel and distributed computing. This is because as the number of threads in an application increases, the communication overhead also increases. Moreover, the presence of background load further increases the latency. Reducing communication latency can have a significant impact on multi-threaded application performance in multi-core environments. In a wide-range of applications that utilize queueing mechanism, inter-process/ inter-thread communication typically involves enqueuing and dequeuing. This paper presents a queueing techniques called eLCRQ, which is a lock-free block-when-necessary multi-producer multi-consumer (MPMC) FIFO queue. It is designed for scenarios where the queue can randomly and frequently become empty during runtime. By combining lock-free performance with blocking resource efficiency, it delivers improved performance. Specifically, it results in a 1.7X reduction in latency and a 2.3X reduction in CPU usage when compared to existing message-passing mechanisms including PIPE and Sockets while running on multi-core Linux based systems. The proposed scheme also provides a 3.4X decrease in CPU usage while maintaining comparable latency when compared to other (MPMC) lock-free queues in low load scenarios. Our work is based on open-source Linux and support libraries.

[1]  Steven Hand,et al.  Draft : Have you checked your IPC performance lately ? , 2012 .

[2]  Nam Sung Kim,et al.  SpinWise: A Practical Energy-Efficient Synchronization Technique for CMPs , 2016, CARN.

[3]  Peter Kilpatrick,et al.  An Efficient Unbounded Lock-Free Queue for Multi-core Systems , 2012, Euro-Par.

[4]  John M. Mellor-Crummey,et al.  A wait-free queue as fast as fetch-and-add , 2016, PPoPP.

[5]  Maurice Herlihy,et al.  A persistent lock-free queue for non-volatile memory , 2018, PPoPP.

[6]  Davidlohr Bueso Futex Scaling for Multi-core Systems , 2016, Applicative 2016.

[7]  Jens Gustedt Futex based locks for C11's generic atomics , 2016, SAC.

[8]  Hafiz Fahad Sheikh,et al.  A multi-staged niched evolutionary approach for allocating parallel tasks with joint optimization of performance, energy, and temperature , 2019, J. Parallel Distributed Comput..

[9]  Victor Luchangco,et al.  BQ: A Lock-Free Queue with Batching , 2018, SPAA.

[10]  Deli Zhang,et al.  A Lock-Free Priority Queue Design Based on Multi-Dimensional Linked Lists , 2016, IEEE Transactions on Parallel and Distributed Systems.

[11]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[12]  Nir Shavit,et al.  An optimistic approach to lock-free FIFO queues , 2004, Distributed Computing.

[13]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[14]  Ulrich Drepper,et al.  Futexes Are Tricky , 2004 .

[15]  Abraham Silberschatz,et al.  Operating System Concepts , 1983 .

[16]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[17]  Nir Shavit,et al.  The Baskets Queue , 2007, OPODIS.