FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.

[1]  Calton Pu,et al.  Threads and input/output in the synthesis kernal , 1989, SOSP '89.

[2]  Maged M. Michael,et al.  Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors , 1998, J. Parallel Distributed Comput..

[3]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[4]  Mikko H. Lipasti,et al.  Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing , 2001, MICRO.

[5]  Saman P. Amarasinghe Multicores from the Compiler's Perspective: A Blessing or a Curse? , 2005, CGO.

[6]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[7]  Theodore Johnson,et al.  A Nonblocking Algorithm for Shared Queues Using Compare-and-Swap , 1994, IEEE Trans. Computers.

[8]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[9]  Guilherme Ottoni,et al.  From sequential programs to concurrent threads , 2006, IEEE Computer Architecture Letters.

[10]  Harrick M. Vin,et al.  Overcoming the memory wall in packet processing , 2005 .

[11]  Patrick Crowley,et al.  Exploiting locality to ameliorate packet queue contention and serialization , 2006, CF '06.

[12]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[13]  John David Valois Lock-free data structures , 1996 .

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[16]  Long Li,et al.  Automatically partitioning packet processing applications for pipelined architectures , 2005, PLDI '05.

[17]  Yi Zhang,et al.  A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems , 2001, SPAA '01.

[18]  Kourosh Gharachorloo,et al.  Detecting violations of sequential consistency , 1991, SPAA '91.

[19]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[20]  Vikram A. Saletore,et al.  ETA: experience with an Intel Xeon processor as a packet processing engine , 2004, IEEE Micro.

[21]  Milind Girkar,et al.  Automatic Extraction of Functional Parallelism from Ordinary Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[22]  William N. Scherer,et al.  Scalable synchronous queues , 2009, Commun. ACM.

[23]  Nir Shavit,et al.  An Optimistic Approach to Lock-Free FIFO Queues , 2004, DISC.

[24]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[25]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[26]  Mark Moir,et al.  Using elimination to implement scalable and lock-free FIFO queues , 2005, SPAA '05.

[27]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[28]  Leslie Lamport,et al.  Specifying Concurrent Program Modules , 1983, TOPL.