Complementing user-level coarse-grain parallelism with implicit speculative parallelism

Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Programs using parallel programming primitives like PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good trade-off between programming effort versus performance gain. Some parallel applications show limited or no scaling beyond a number of cores. Given the abundant number of cores expected in future many-cores, several cores would remain idle in such cases while execution performance stagnates. This paper proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads. In particular, we present a many-core architecture and protocol that allow applications with coarse-grain explicit parallelism to further exploit implicit speculative parallelism within each thread. Implicit speculative parallelism frees the programmer from the additional effort to explicitly partition the work into finer and properly synchronized tasks. Our results show that, for a many-core comprising of 128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance improves on top of the highest scalability point by 41% on average for the 4-core cluster and by 27% on average for the 2-core cluster. These performance improvements come with an energy consumption that is close to - and sometimes better than - the baseline. This approach often leads to better performance and energy efficiency compared to existing alternatives such as Core Fusion and Frequency Boosting. We also investigate the tradeoffs between explicit and implicit threads as input dataset sizes vary. Finally, we present a dynamic mechanism to choose the number of explicit and implicit threads, which performs within 6% of the static oracle selection of threads.

[1]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[2]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[3]  Gurindar S. Sohi,et al.  Speculative versioning cache , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[4]  Kunle Olukotun,et al.  A Scalable, Non-blocking Approach to Transactional Memory , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Kunle Olukotun,et al.  An effective hybrid transactional memory system with strong isolation guarantees , 2007, ISCA '07.

[6]  James E. Smith,et al.  Managing multi-configuration hardware via dynamic working set analysis , 2002, ISCA.

[7]  Margaret Martonosi,et al.  Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[9]  Marcelo Cintra,et al.  Handling branches in TLS systems with Multi-Path Execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[10]  Feng Liu,et al.  Scalable Speculative Parallelization on Commodity Clusters , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[12]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[13]  Milind Girkar,et al.  On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings , 2006, ICS '06.

[14]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[15]  David A. Wood,et al.  Supporting nested transactional memory in logTM , 2006, ASPLOS XII.

[16]  Sandhya Dwarkadas,et al.  Characterizing and predicting program behavior and its variability , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[17]  Mitsuhisa Sato,et al.  Performance Evaluation of the Omni OpenMP Compiler , 2000, ISHPC.

[18]  Michael L. Scott,et al.  Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor , 2003, ISCA '03.

[19]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[20]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[21]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[22]  Antonia Zhai,et al.  Compiler optimization of scalar value communication between speculative threads , 2002, ASPLOS X.

[23]  Margaret Martonosi,et al.  Multipath execution: opportunities and limits , 1998, ICS '98.

[24]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[25]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[26]  Michael C. Huang,et al.  Positional adaptation of processors: application to energy reduction , 2003, ISCA '03.

[27]  Nikolas Ioannou,et al.  Increasing the energy efficiency of TLS systems using intermediate checkpointing , 2011, 2011 18th International Conference on High Performance Computing.

[28]  Josep Torrellas,et al.  Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[29]  D.K. Lowenthal,et al.  Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[30]  Pedro López,et al.  Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[31]  José González,et al.  Meeting points: Using thread criticality to adapt multicore hardware to parallel regions , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Tajana Simunic,et al.  Dynamic voltage frequency scaling for multi-tasking systems using online learning , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[33]  Gu-Yeon Wei,et al.  Thread motion: fine-grained power management for multi-core systems , 2009, ISCA '09.

[34]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[35]  Nikolas Ioannou,et al.  Combining thread level speculation helper threads and runahead execution , 2009, ICS.

[36]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[37]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[38]  Dimitrios S. Nikolopoulos,et al.  Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[39]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[40]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[41]  Yun Zhang,et al.  Decoupled software pipelining creates parallelization opportunities , 2010, CGO '10.

[42]  Josep Torrellas,et al.  Tradeoffs in buffering memory state for thread-level speculation in multiprocessors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[43]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[44]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[45]  Josep Torrellas,et al.  Architectural support for scalable speculative parallelization in shared-memory multiprocessors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[46]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[47]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[48]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[49]  Easwaran Raman,et al.  Spice: speculative parallel iteration chunk execution , 2008, CGO '08.

[50]  Bronis R. de Supinski,et al.  Prediction models for multi-dimensional power-performance optimization on many cores , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[51]  Pedro López,et al.  Boosting single-thread performance in multi-core systems through fine-grain multi-threading , 2009, ISCA '09.

[52]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[53]  Dean M. Tullsen,et al.  Mapping Out a Path from Hardware Transactional Memory to Speculative Multithreading , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[54]  Mahmut T. Kandemir,et al.  Exploiting barriers to optimize power consumption of CMPs , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[55]  David A. Wood,et al.  LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[56]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[57]  James R. Goodman,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, MICRO.

[58]  David K. Lowenthal,et al.  Using multiple energy gears in MPI programs on a power-scalable cluster , 2005, PPoPP.

[59]  Milo M. K. Martin,et al.  Deconstructing Transactional Semantics: The Subtleties of Atomicity , 2005 .

[60]  Wei Liu,et al.  Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[61]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[62]  Vivek Sarkar,et al.  Partitioning parallel programs for macro-dataflow , 1986, LFP '86.

[63]  Ryan N. Rakvic,et al.  The Fuzzy Correlation between Code and Performance Predictability , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[64]  Rudolf Eigenmann,et al.  Cetus: A Source-to-Source Compiler Infrastructure for Multicores , 2009, Computer.

[65]  Marc Tremblay,et al.  A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[66]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[67]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[68]  Michael C. Huang,et al.  The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[69]  Saurabh Dighe,et al.  Within-die variation-aware dynamic-voltage-frequency scaling core mapping and thread hopping for an 80-core processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[70]  Steven M. Nowick,et al.  Robust interfaces for mixed-timing systems with application to latency-insensitive protocols , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[71]  Luis Ceze,et al.  Implicit parallelism with ordered transactions , 2007, PPoPP.

[72]  Michael C. Huang,et al.  Dynamically Tuning Processor Resources with Adaptive Processing , 2003, Computer.

[73]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[74]  J. Gregory Steffan,et al.  Improving cache locality for thread-level speculation , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[75]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[76]  Marc Tremblay,et al.  Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor , 2009, ISCA '09.

[77]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[78]  Gurindar S. Sohi,et al.  Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[79]  Stijn Eyerman,et al.  Modeling critical sections in Amdahl's law and its implications for multicore design , 2010, ISCA '10.

[80]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[81]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[82]  Gavin Brown,et al.  Toward a more accurate understanding of the limits of the TLS execution paradigm , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[83]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[84]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[85]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[86]  Anastasia Ailamaki,et al.  Tolerating Dependences Between Large Speculative Threads Via Sub-Threads , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[87]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[88]  Antonio González,et al.  Clustered speculative multithreaded processors , 1999, ICS '99.

[89]  V. Rich Personal communication , 1989, Nature.

[90]  Dean M. Tullsen,et al.  Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices , 2005, PLDI '05.

[91]  Antonia Zhai,et al.  Dynamic performance tuning for speculative threads , 2009, ISCA '09.

[92]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[93]  Babak Falsafi,et al.  Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor , 2001, ICS '01.

[94]  Antonia Zhai,et al.  The STAMPede approach to thread-level speculation , 2005, TOCS.

[95]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[96]  David K. Lowenthal,et al.  Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster , 2006, PPoPP '06.

[97]  Sally A. McKee,et al.  Understanding PARSEC performance on contemporary CMPs , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[98]  MartinezAlejandro,et al.  Boosting single-thread performance in multi-core systems through fine-grain multi-threading , 2009 .

[99]  Josep Torrellas,et al.  Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor , 1998, ICS '98.

[100]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[101]  Margaret Martonosi,et al.  Techniques for Multicore Thermal Management: Classification and New Exploration , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[102]  Manoj Franklin,et al.  A general compiler framework for speculative multithreading , 2002, SPAA '02.

[103]  Jose Renau,et al.  Energy-Efficient Thread-Level Speculation , 2006, IEEE Micro.

[104]  Shekhar Y. Borkar,et al.  Thousand Core ChipsA Technology Perspective , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[105]  Edward A. Lee The problem with threads , 2006, Computer.

[106]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[107]  Murali Annavaram,et al.  Mitigating Amdahl's Law through EPI Throttling , 2005, ISCA 2005.