Speculative lock elision: enabling highly concurrent multithreaded execution

Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot fully exploit such parallelism because they do not have mechanisms to dynamically detect such false inter-thread dependences.We propose Speculative Lock Elision (SLE), a novel micro-architectural technique to remove dynamically unnecessary lock-induced serialization and enable highly concurrent multithreaded execution. The key insight is that locks do not always have to be acquired for a correct execution. Synchronization instructions are predicted as being unnecessary and elided. This allows multiple threads to concurrently execute critical sections protected by the same lock. Misspeculation due to inter-thread data conflicts is detected using existing cache mechanisms and rollback is used for recovery. Successful speculative elision is validated and committed without acquiring the lock.SLE can be implemented entirely in microarchitecture without instruction set support and without system-level modifications, is transparent to programmers, and requires only trivial additional hardware support. SLE can provide programmers a fast path to writing correct high-performance multithreaded programs.

[1]  Gurindar S. Sohi,et al.  Speculative versioning cache , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[2]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[3]  R. H. Katz,et al.  Using cache mechanisms to exploit nonrefreshing DRAMs for on-chip memories , 1991 .

[4]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[5]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[6]  James R. Goodman,et al.  Improving the throughput of synchronization by insertion of delays , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[7]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[8]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[9]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data objects , 1993, TOPL.

[10]  Alan Charlesworth,et al.  Gigaplane-XB: Extending the Ultra Enterprise Family , 1997 .

[11]  Mikko H. Lipasti,et al.  On the value locality of store instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[12]  Leslie Lamport,et al.  Concurrent reading and writing , 1977, Commun. ACM.

[13]  Philip Heidelberger,et al.  Multiple reservations and the Oklahoma update , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[14]  Sung-Eun Choi,et al.  A study of common pitfalls in simple multi-threaded programs , 2000, SIGCSE '00.

[15]  Joel Emer,et al.  Proceedings of the 50th Annual International Symposium on Computer Architecture , 2000, International Symposium on Computer Architecture.

[16]  Alexander Thomasian,et al.  Concurrency control: methods, performance, and analysis , 1998, CSUR.

[17]  Larry Rudolph,et al.  Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA '84.

[18]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[19]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[20]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[21]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[22]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[23]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[24]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[25]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[26]  Edward W. Felten,et al.  Performance issues in non-blocking synchronization on shared-memory multiprocessors , 1992, PODC '92.

[27]  Tareef Kawaf,et al.  Performance Analysis Using Very Large Memory on the 64-bit AlphaServer System , 1996, Digit. Tech. J..

[28]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[29]  Milo M. K. Martin,et al.  Fast Checkpoint/Recovery to Support Kilo-Instruction Speculation and Hardware Fault Tolerance , 2000 .

[30]  B. Bershad Practical considerations for lock-free concurrent objects , 1991 .

[31]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[32]  T. N. Vijaykumar,et al.  Is SC+ILP=RC? , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[33]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[34]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).