Waiting algorithms for synchronization in large-scale multiprocessors

Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread first waits by polling a synchronization variable. If the cost of polling reaches a limit <italic>L<subscrpt>poll</subscrpt></italic> and further waiting is necessary, the thread is blocked, incurring an additional fixed cost, <italic>B</italic>. The choice of <italic>L<subscrpt>poll</subscrpt></italic> is a critical determinant of the performance of two-phase algorithms. We focus on methods for statically determining <italic>L<subscrpt>poll</subscrpt></italic> because the run-time overhead of dynamically determining <italic>L<subscrpt>poll</subscrpt></italic> can be comparable to the cost of blocking in large-scale multiprocessor systems with lightweight threads. Our experiments show that <italic>always-block</italic> (<italic>L<subscrpt>poll</subscrpt></italic> = 0) is a good waiting algorithm with performance that is usually close to the best of the algorithms compared. We show that even better performance can be achieved with a static choice of <italic>L<subscrpt>poll</subscrpt></italic> based on knowledge of likely wait-time distributions. Motivated by the observation that different synchronization types exhibit different wait-time distributions, we prove that a static choice of <italic>L<subscrpt>poll</subscrpt></italic> can yield close to optimal on-line performance against an adversary that is restricted to choosing wait times from a fixed family of probability distributions. This result allows us to make an optimal static choice of <italic>L<subscrpt>poll</subscrpt></italic> based on synchronization type. For exponentially distributed wait times, we prove that setting <italic>L<subscrpt>poll</subscrpt></italic> = 1n(e-1)<italic>B</italic> results in a waiting cost that is no more than <italic>e/(e-1)</italic> times the cost of an optimal off-line algorithm. For uniformly distributed wait times, we prove that setting <italic>L</italic><subscrpt>poll</subscrpt>=1/2(square root of 5 -1)<italic>B</italic> results in a waiting cost that is no more than (square root of 5 + 1)/2 (the golden ratio) times the cost of an optimal off-line algorithm. Experimental measurements of several parallel applications on the Alewife multiprocessor simulator corroborate our theoretical findings.

[1]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[2]  David Chaiken,et al.  Latency Tolerance through Multithreading in Large-Scale Multiprocessors , 1991 .

[3]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[4]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[5]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[6]  Arvind,et al.  M-Structures: Extending a Parallel, Non-strict, Functional Language with State , 1991, FPCA.

[7]  B. Bershad Practical considerations for lock-free concurrent objects , 1991 .

[8]  John L. Hennessy,et al.  Characterizing the synchronization behavior of parallel programs , 1988, PPEALS '88.

[9]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[10]  Nian-Feng Tzeng,et al.  Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.

[11]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[12]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[13]  A. Agarwal,et al.  Adaptive backoff synchronization techniques , 1989, ISCA '89.

[14]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[15]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[16]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[17]  Keshav Pingali,et al.  I-structures: Data structures for parallel computing , 1986, Graph Reduction.

[18]  Terry Williams,et al.  Probability and Statistics with Reliability, Queueing and Computer Science Applications , 1983 .

[19]  Anna R. Karlin,et al.  Competitive randomized algorithms for non-uniform problems , 1990, SODA '90.

[20]  Virgil D. Gligor,et al.  A Comparative Analysis of Multiprocessor Scheduling Algorithms , 1987, ICDCS.

[21]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[22]  Thomas E. Anderson,et al.  The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.

[23]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[24]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[25]  Jeannette M. Wing,et al.  A Library of Concurrent Objects and Their Proofs of Correctness , 1990 .

[26]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[27]  Larry Rudolph,et al.  Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA '84.

[28]  AgarwalAnant,et al.  Waiting algorithms for synchronization in large-scale multiprocessors , 1993 .

[29]  Maurice Herlihy,et al.  Counting networks and multi-processor coordination , 1991, STOC '91.

[30]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data structures , 1990, PPOPP '90.