Combining Funnels: A Dynamic Approach to Software Combining

We enhance the well-established software combining synchronization technique to create combining funnels. Previous software combining methods used a statically assigned tree whose depth was logarithmic in the total number of processors in the system. On shared memory multiprocessors the new method allows one to dynamically build combining trees with depth logarithmic in the actual number of processors concurrently accessing the data structure. The structure is comprised from a series of combining layers through which processors' requests are funneled. These layers use randomization instead of a rigid tree structure to allow processors to find partners for combining. By using an adaptive scheme the funnel can change width and depth to accommodate different access frequencies without requiring global agreement as to its size. Rather, processors choose parameters of the protocol privately, making this scheme very simple to implement and tune. When we add an “elimination” mechanism to the funnel structure, the randomly constructed “tree” is transformed into a “forest” of disjoint (and on average shallower) trees of requests, thus enhancing the level of parallelism and decreasing latency. We present two new linearizable combining funnel based data structures: a fetch-and-add object and a stack. We study the performance of these structures by benchmarking them against the most efficient software implementations of fetch-and-add and stacks known to date, combining trees and elimination trees, on a simulated shared memory multiprocessor using Proteus. Our empirical data shows that combining funnel-based fetch-and-add outperforms combining trees of fixed height by as much as 70%. In fact, even compared to combining trees optimized for a given load, funnel performance is the same or better. Elimination trees, which are not linearizable, are 10% faster than funnels under highest load, but as load drops, combining funnels adapt their size, giving them a 34% lead in latency.

[1]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[2]  E. Upfal A Steady State Analysis of Diiracting Trees , 1997 .

[3]  A. Agarwal,et al.  Adaptive backoff synchronization techniques , 1989, ISCA '89.

[4]  Nian-Feng Tzeng,et al.  Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.

[5]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[6]  Nir Shavit,et al.  Combining funnels: a new twist on an old tale… , 1998, PODC '98.

[7]  Larry Rudolph,et al.  Efficient synchronization of multiprocessors with shared memory , 1988, TOPL.

[8]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[9]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[10]  Nir Shavit,et al.  Reactive Diiracting Trees , 1997 .

[11]  Nir Shavityz,et al.  Diiracting Trees , 1994 .

[12]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[13]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[14]  Nir Shavit,et al.  Reactive diffracting trees , 1997, SPAA '97.

[15]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[16]  RudolphLarry,et al.  Efficient synchronization of multiprocessors with shared memory , 1988 .

[17]  N. Shavit,et al.  Di racting Trees , 1996 .

[18]  Srinivasan Parthasarathy,et al.  An Efficient Algorithm for Concurrent Priority Queue Heaps , 1996, Inf. Process. Lett..

[19]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[20]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[21]  Eli Upfal,et al.  A Steady State Analysis of Diffracting Trees , 1998, Theory of Computing Systems.

[22]  N. Shavit,et al.  Reactive Di racting Trees , 1997 .

[23]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[24]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[25]  Dieter Gawlick,et al.  Processing "Hot Spots" in High Performance Systems , 1985, COMPCON.

[26]  Maurice Herlihy,et al.  Scalable concurrent counting , 1995, TOCS.

[27]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[28]  Eric A. Brewer,et al.  PROTEUS: a high-performance parallel-architecture simulator , 1992, SIGMETRICS '92/PERFORMANCE '92.

[29]  Michael L. Scott,et al.  Contention-free combining tree barriers , 1994 .

[30]  Larry Rudolph,et al.  Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors , 1983, TOPL.

[31]  Nir Shavit,et al.  Diffracting trees , 1996, TOCS.

[32]  Anant Agarwal,et al.  Waiting algorithms for synchronization in large-scale multiprocessors , 1993, TOCS.

[33]  Michel Dubois,et al.  Scalable Shared Memory Multiprocessors , 1992, Springer US.

[34]  Nancy A. Lynch,et al.  Hierarchical correctness proofs for distributed algorithms , 1987, PODC '87.

[35]  Michael L. Scott,et al.  Fast, Contention-Free Combining Tree Barriers , 1992 .

[36]  E. Upfal A Steady State Analysis of Diiracting Trees , 1997 .

[37]  Nir Shavit,et al.  Elimination Trees and the Construction of Pools and Stacks , 1997, Theory of Computing Systems.

[38]  Nir Shavit,et al.  Scalable concurrent priority queue algorithms , 1999, PODC '99.