Bahurupi: A polymorphic heterogeneous multi-core architecture

Computing systems have made an irreversible transition towards parallel architectures with the emergence of multi-cores. Moreover, power and thermal limits in embedded systems mandate the deployment of many simpler cores rather than a few complex cores on chip. Consumer electronic devices, on the other hand, need to support an ever-changing set of diverse applications with varying performance demands. While some applications can benefit from thread-level parallelism offered by multi-core solutions, there still exist a large number of applications with substantial amount of sequential code. The sequential programs suffer from limited exploitation of instruction-level parallelism in simple cores. We propose a reconfigurable multi-core architecture, called Bahurupi, that can successfully reconcile the conflicting demands of instruction-level and thread-level parallelism. Bahurupi can accelerate the performance of serial code by dynamically forming coalition of two or more simple cores to offer increased instruction-level parallelism. In particular, Bahurupi can efficiently merge 2-4 simple 2-way out-of-order cores to reach or even surpass the performance of more complex and power-hungry 4-way or 8-way out-of-order core. Compared to baseline 2-way core, quad-core Bahurupi achieves up to 5.61 speedup (average 4.08 speedup) for embedded workloads. On an average, quad-core Bahurupi achieves 17% performance improvement and 43% improvement in energy consumption compared to 8-way out-of-order baseline core on a diverse set of embedded benchmark applications.

[1]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[2]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[4]  Craig B. Zilles,et al.  Fundamental performance constraints in horizontal fusion of in-order cores , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[5]  S. Winkel Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[7]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[8]  Srinivas Devadas,et al.  Dynamic Cache Partitioning via Columnization , 2000, DAC 2000.

[9]  Ravi Rajwar,et al.  The impact of performance asymmetry in emerging multicore architectures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[11]  W. Campbell,et al.  THE UNIVERSITY OF TEXAS AT DALLAS , 2004 .

[12]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[13]  Kevin Skadron,et al.  Federation: Repurposing scalar cores for out-of-order instruction issue , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[14]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[15]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Rainer Leupers,et al.  Multiprocessor Systems on Chip , 2011 .

[17]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[18]  Scott A. Mahlke,et al.  Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[19]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Eric Rotenberg,et al.  FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[21]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[22]  Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, Salt Lake City, UT, USA, March 1-5, 2014 , 2014, ASPLOS.

[23]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[24]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[25]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008 .

[26]  Wayne H. Wolf,et al.  Multiprocessor Systems-on-Chips , 2004, ISVLSI.

[27]  Norman P. Jouppi,et al.  Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).