Performance and power simulation of a functional-unit-network processor with simplescalar and wattch

Loop acceleration is a means to enhance performance of a single- or multiple-issue microprocessor core. A new edge-like processor architecture incorporates a loop accelerator directly in the out-of-order back end of the processor, forming an extended hypercube interconnected network of functional unit nodes. In this work, we have simulated a full processor pipeline of our architecture in a high-level language. In particular, we have extended the Simplescalar, a well-known processor simulator, to include our multifunctional-unit back-end design, and to support our special instructions for loop acceleration. Thus, instructions forming qualified loops are scheduled and dispatched only once for execution, remaining in the back end for all loop iterations, interchanging values in a data-flow fashion. We have also utilized the Wattch power estimation tool, which has been traditionally coupling Simplescalar to produce an estimation of power consumption during simulation, to show that our design results in significant power savings. Since loop instructions reside in the functional unit nodes during loop execution, all front end of the pipeline is turned off and the register file and the instruction cache are kept at low power at that time. Experiments conducted include simulating execution of small loop-based benchmarks from the Livermore loops, as well as longer real-life code taken from open-source mpeg video compression codes. All experiments exhibit the expected performance and power consumption improvements, verifying earlier performance measurements on the HDL model of the back end.

[1]  Krisztián Flautner,et al.  Evolution of thread-level parallelism in desktop applications , 2010, ISCA.

[2]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[3]  Guy Lemieux,et al.  Vector processing as a soft-core CPU accelerator , 2008, FPGA '08.

[4]  Jeffrey R. Diamond,et al.  An evaluation of the TRIPS computer system , 2009, ASPLOS.

[5]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.

[6]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[7]  Georgios Dimitriou,et al.  Rapid, low-power loop execution in a network of functional units , 2013, PCI '13.

[8]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[9]  Doug Burger,et al.  Breaking the GOP / Watt Barrier with EDGE Architectures , 2005 .

[10]  Aaron Smith,et al.  Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[11]  Sri Parameswaran,et al.  Novel architecture for loop acceleration: a case study , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[12]  Al Davis,et al.  A loop accelerator for low power embedded VLIW processors , 2004, CODES+ISSS '04.

[13]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Amin Ansari,et al.  Erasing Core Boundaries for Robust and Configurable Performance , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.