Parabilis: Speeding up Single-Threaded Applications by Extracting Fine-Grained Threads for Multi-core Execution

The trend in architectural designs has been towards using simple cores for building multicore chips, instead of a single complex out-of-order (OOO) core, due to the increased complexity and energy requirements of out of order processors. Multicore chips provide better performance when compared with OOO cores while executing parallel applications. However, they are not able to exploit the parallelism inherent in single threaded applications. To this end, this paper presents a compiler optimization methodology coupled with minimal hardware extensions to extract simple fine-grained threads from a single-threaded application, for execution on multiple cores of a chip multiprocessor (CMP). These fine-grained threads are independent and eliminate the need for communication between cores, reducing costly communication latencies. This approach, which we call Parabilis is scalable for up to eight cores, and does not require complex hardware additions to simple multicore systems. Our evaluation shows that Parabilis yields an average speedup of 1.51 on an 8-core CMP architecture.

[1]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[2]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[3]  Kunle Olukotun,et al.  TEST: a Tracer for Extracting Speculative Threads , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[4]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[5]  Rudolf Eigenmann,et al.  Speculative thread decomposition through empirical optimization , 2007, PPoPP.

[6]  Pedro López,et al.  Boosting single-thread performance in multi-core systems through fine-grain multi-threading , 2009, ISCA '09.

[7]  Michel Dubois,et al.  Loop-level Speculative Parallelism in Embedded Applications , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[8]  Mark Heffernan,et al.  Data-Dependency Graph Transformations for Instruction Scheduling , 2005, J. Sched..

[9]  Mahmut T. Kandemir,et al.  Compiler-directed instruction duplication for soft error detection , 2005, Design, Automation and Test in Europe.

[10]  Kunle Olukotun,et al.  The Jrpm system for dynamically parallelizing Java programs , 2003, ISCA '03.

[11]  Kevin Skadron,et al.  Federation: Out-of-Order Execution using Simple In-Order Cores , 2007 .

[12]  Mahmut T. Kandemir,et al.  A helper thread based EDP reduction scheme for adapting application execution in CMPs , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  J. M. Codina,et al.  Instruction replication for clustered microarchitectures , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[14]  Manoj Franklin,et al.  Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency , 2005, IEEE Trans. Computers.

[15]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[16]  David R. Kaeli,et al.  AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustered Microarchitectures , 2009, IEEE Transactions on Computers.

[17]  Bo Han,et al.  Prophet Synchronization Thread Model and Compiler Support , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[18]  HeffernanMark,et al.  Data-Dependency Graph Transformations for Instruction Scheduling , 2005 .

[19]  Josep Torrellas,et al.  Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[20]  Scott A. Mahlke,et al.  Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.