Architecting a workload-agnostic heterogeneous multi-core processor

Improving single-thread performance still remains an important challenge. Heterogeneous multi-cores offer the potential of improving single-thread performance by providing a number of core types that capture a wide range of application behaviors. Some prior approaches of choosing the constituent cores in a heterogeneous multi-core have focused on reducing power consumption by employing monotonic cores. Other approaches that aim to improve performance assume a priori knowledge of the workload. It is uncertain how such workload-specific approaches would perform if the workload changes in the future. This dissertation addresses the question of choosing the cores in a heterogeneous multi-core in a workload-agnostic manner. The process of selecting the constituent cores is completely independent of any benchmark suite. We present several approaches of choosing cores, and show that the resulting multi-core delivers high performance for a large number of application phases. We classify applications in one of four categories of kernels: pointer-chasing, array manipulation, arbitrary serial and arbitrary parallel. We systematically vary instruction-level parameters and evaluate the highest performing heterogeneous multi-cores using synthetically generated kernels. Since the resulting multi-core is workload-agnostic, its performance on real application phases is almost as good as a customized heterogeneous multi-core. Moreover, we demonstrate potential pitfalls of customization by showing that multi-cores tuned to a subset of the actual workload may perform poorly on the entire workload. We use statistical tools such as classification trees to understand the relationships between instruction-level parameters and core suitability. The classification trees are used as a starting point for application steering mechanisms. We show that an application steering mechanism based on classification trees performs better than random steering on average.

[1]  Matt T. Yourst PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[2]  John Paul Shen,et al.  Best of both latency and throughput , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[3]  Hiran Mayukh FabIssue: Automatic RTL Generation of Issue Logic in Superscalar Processors for Core Customization. , 2010 .

[4]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[5]  Karthikeyan Sankaralingam,et al.  Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[6]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[7]  David M. Brooks,et al.  Efficiency trends and limits from comprehensive microarchitectural adaptivity , 2008, ASPLOS.

[8]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[9]  John Paul Shen,et al.  Theoretical modeling of superscalar processor performance , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[11]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[12]  Eric Rotenberg,et al.  The importance of accurate task arrival characterization in the design of processing cores , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[14]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[15]  Eric Rotenberg,et al.  Architectural Contesting , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[16]  Norman P. Jouppi,et al.  Core architecture optimization for heterogeneous chip multiprocessors , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[18]  Michael F. P. O'Boyle,et al.  A Predictive Model for Dynamic Microarchitectural Adaptivity Control , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Lieven Eeckhout,et al.  Performance analysis through synthetic trace generation , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[20]  David M. Brooks,et al.  Illustrative Design Space Studies with Microarchitectural Regression Models , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[21]  Eric Rotenberg,et al.  FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[22]  John Paul Shen,et al.  A framework for statistical modeling of superscalar processor performance , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[23]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.