Hybrid multi-core architecture for boosting single-threaded performance

The scaling of technology and the diminishing return of complicated uniprocessors have driven the industry towards multicore processors. While multithreaded applications can naturally leverage the enhanced throughput of multi-core processors, a large number of important applications are single-threaded, which cannot automatically harness the potential of multi-core processors. In this paper, we propose a compiler-driven heterogeneous multicore architecture, consisting of tightly-integrated VLIW (Very Long Instruction Word) and superscalar processors on a single chip, to automatically boost the performance of single-threaded applications without compromising the capability to support multithreaded programs. In the proposed multi-core architecture, while the high-performance VLIW core is used to run code segments with high instruction-level parallelism (ILP) extracted by the compiler; the superscalar core can be exploited to deal with the runtime events that are typically difficult for the VLIW core to handle, such as L2 cache misses. Our initial experimental results by running the preexecution thread on the superscalar core to mitigate the L2 cache misses of the main thread on the VLIW core indicate that the proposed VLIW/superscalar multi-core processor can automatically improve the performance of single-threaded general-purpose applications by up to 40.8%.

[1]  Dan Boneh,et al.  Architectural support for copy and tamper resistant software , 2000, SIGP.

[2]  Sumedh W. Sathaye,et al.  Properties of Rescheduling Size Invariance for Dynamic Rescheduling-Based VLIW Cross-Generation Compatibility , 2000, IEEE Trans. Computers.

[3]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[4]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[5]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[6]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[7]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[8]  Yonghong Song,et al.  Design and implementation of a compiler framework for helper threading on multi-core processors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[9]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[10]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[11]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[12]  Konrad K. Lai,et al.  The Impact of Performance Asymmetry in Emerging Multicore Architectures , 2005, ISCA 2005.

[13]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  John Paul Shen,et al.  Post-pass binary adaptation for software-based speculative precomputation , 2002, PLDI '02.

[15]  Ravi Rajwar,et al.  The impact of performance asymmetry in emerging multicore architectures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[17]  John Paul Shen,et al.  Speculative Precomputation on Chip Multiprocessors , 2002 .

[18]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[19]  Trevor N. Mudge,et al.  ChipLock: support for secure microarchitectures , 2005, CARN.

[20]  B. Ramakrishna Rau,et al.  Dynamically scheduled VLIW processors , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[21]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[22]  Erik R. Altman,et al.  BOA: The Architecture of a Binary Translation Processor , 1999 .

[23]  B. Ramakrishna Rau,et al.  EPIC: An Architecture for Instruction-Level Parallel Processors , 2000 .

[24]  Microsystems Sun,et al.  Jini^ Architecture Specification Version 2.0 , 2003 .

[25]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[26]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[27]  Utpal Banerjee Loop Parallelization , 1994, Springer US.

[28]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[29]  Scott A. Mahlke,et al.  IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors , 1998, 25 Years ISCA: Retrospectives and Reprints.

[30]  David I. August,et al.  Chip multi-processor scalability for single-threaded applications , 2005, CARN.

[31]  James R. Larus,et al.  Branch prediction for free , 1993, PLDI '93.

[32]  B. R. Rau,et al.  HPL-PD Architecture Specification:Version 1.1 , 2000 .

[33]  Paolo Faraboschi,et al.  Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools , 2004 .

[34]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[35]  Joseph A. Fisher,et al.  Very long instruction work architectures and the ELI-512 , 1983, ISCA '98.

[36]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[37]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[38]  Karthikeyan Sankaralingam,et al.  A design space evaluation of grid processor architectures , 2001, MICRO.

[39]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[40]  S. Sudharsanan,et al.  Image and video processing using MAJC 5200 , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[41]  Kunle Olukotun,et al.  Using thread-level speculation to simplify manual parallelization , 2003, PPoPP '03.

[42]  Guang R. Gao,et al.  Design and Implementation of an Efficient Thread Partitioning Algorithm , 2000, ISHPC.

[43]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.