Dynamic Optimization Techniques

As has been emphasized throughout this book, it is necessary a high level of adaptability to cope with the high heterogeneous behavior of recent applications. At the same time, binary code compatibility is mandatory, so the large amount of already existing software can be reused without any kind of modification. In this scenario, this chapter discusses dynamic optimization techniques, how they can be used to improve performance, how they maintain binary compatibility and some case studies. The chapter starts presenting Binary translation. Its main concepts are clarified, as well as the main challenges that a binary translator mechanism must handle to work properly. The section ends with a detailed view of some examples of Binary Translation machines. Then, Reuse is discussed, and diverse types of it are covered: instruction reuse, value prediction, basic block, trace reuse and dynamic trace memoization. Furthermore, according to the discussion made in Chap.3, even though reconfigurable systems present huge potentials in terms of performance and energy, they alone cannot deal with the high heterogeneous behavior of recent applications neither maintain binary compatibility. Therefore, this chapter ends presenting approaches that use reconfigurable architectures together with mechanisms that somehow reassembles the behavior of the dynamic optimization techniques.

[1]  Abhijit Chatterjee,et al.  System level power-performance trade-offs in embedded systems using voltage and frequency scaling of off-chip buses and memory , 2002, 15th International Symposium on System Synthesis, 2002..

[2]  Jian Huang,et al.  Extending Value Reuse to Basic Blocks with Compiler Support , 2000, IEEE Trans. Computers.

[3]  Luigi Carro,et al.  Dynamic Instruction Merging and a Reconfigurable Array: Dataflow Execution with Software Compatibility , 2006, ARC.

[4]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[5]  Tulika Mitra,et al.  Characterizing embedded applications for instruction-set extensible processors , 2004, Proceedings. 41st Design Automation Conference, 2004..

[6]  Luigi Carro,et al.  Transparent Dataflow Execution for Embedded Applications , 2007, IEEE Computer Society Annual Symposium on VLSI (ISVLSI '07).

[7]  Dongrui Fan,et al.  Preliminary Investigation of Accelerating Molecular Dynamics Simulation on Godson-T Many-Core Processor , 2010, Euro-Par Workshops.

[8]  Luigi Carro,et al.  Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications , 2008, 2008 Design, Automation and Test in Europe.

[9]  Frank Vahid,et al.  A configurable logic architecture for dynamic hardware/software partitioning , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[10]  Luigi Carro,et al.  TLP and ILP exploitation through a reconfigurable multiprocessor system , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[11]  Gurindar S. Sohi,et al.  Understanding the differences between value prediction and instruction reuse , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[13]  Luigi Carro,et al.  Dynamic reconfiguration with binary translation: breaking the ILP barrier with software compatibility , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[14]  Richard L. Sites,et al.  Binary translation , 1993, CACM.

[15]  Erik R. Altman,et al.  Welcome to the Opportunities of Binary Translation , 2000, Computer.

[16]  Luigi Carro,et al.  Automatic Dataflow Execution with Reconfiguration and Dynamic Instruction Merging , 2006, 2006 IFIP International Conference on Very Large Scale Integration.

[17]  Jean-Luc Gaudiot,et al.  SMT Layout Overhead and Scalability , 2002, IEEE Trans. Parallel Distributed Syst..

[18]  Luigi Carro,et al.  Transparent acceleration of data dependent instructions for general purpose processors , 2007, 2007 IFIP International Conference on Very Large Scale Integration.

[19]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[20]  Ewa Z. Bem,et al.  MiniMIPS: a simulation project for the computer architecture laboratory , 2003, SIGCSE.

[21]  Frank Vahid,et al.  Design and implementation of a MicroBlaze-based warp processor , 2009, TECS.

[22]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23]  Dongrui Fan,et al.  Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor , 2011, CF '11.

[24]  Richard L. Sites,et al.  Binary translation : Digital's alpha chip project , 1993 .

[25]  James E. Smith,et al.  A study of branch prediction strategies , 1981, ISCA '98.

[26]  Luigi Carro,et al.  Application of binary translation to Java reconfigurable architectures , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[27]  Luigi Carro,et al.  Reducing interconnection cost in coarse-grained dynamic computing through multistage network , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[28]  Felipe Maia Galvão França,et al.  The dynamic trace memoization reuse technique , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[29]  Gurindar S. Sohi,et al.  An empirical analysis of instruction repetition , 1998, ASPLOS VIII.

[30]  Luigi Carro,et al.  CReAMS: An Embedded Multiprocessor Platform , 2011, ARC.

[31]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[32]  Frank Vahid,et al.  Dynamic hardware/software partitioning: a first approach , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[33]  Luigi Carro,et al.  Exploiting Java Through Binary Translation for Low Power Embedded Reconfigurable Systems , 2005, 2005 18th Symposium on Integrated Circuits and Systems Design.

[34]  Jian Huang,et al.  Exploiting basic block value locality with block reuse , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[35]  Frank Vahid,et al.  Hardware/software partitioning of software binaries: a case study of H.264 decode , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[36]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[37]  Nathan Clark,et al.  An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors , 2005, ISCA 2005.

[38]  Luigi Carro,et al.  Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment , 2011, Int. J. Reconfigurable Comput..

[39]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[40]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[41]  Luigi Carro,et al.  Trading Time and Space on Low Power Embedded Architectures with Dynamic Instruction Merging , 2005, J. Low Power Electron..

[42]  Michael Gschwind,et al.  An eight-issue tree-VLIW processor for dynamic binary translation , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[43]  Antonio González,et al.  Trace-level reuse , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[44]  Avi Mendelson,et al.  Using value prediction to increase the power of speculative execution hardware , 1998, TOCS.

[45]  Scott Mahlke,et al.  Processor acceleration through automated instruction set customization , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[46]  Frank Vahid,et al.  Warp Processing: Dynamic Translation of Binaries to FPGA Circuits , 2008, Computer.

[47]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[48]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[49]  Raymond J. Hookway,et al.  DIGITAL FX!32: Combining Emulation and Binary Translation , 1997, Digit. Tech. J..

[50]  Maurício L. Pilla,et al.  The limits of speculative trace reuse on deeply pipelined processors , 2003, Proceedings. 15th Symposium on Computer Architecture and High Performance Computing.

[51]  Wendong Hu,et al.  NetBench: a benchmarking suite for network processors , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[52]  Luigi Carro,et al.  A low cost and adaptable routing network for reconfigurable systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[53]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[54]  Muhammad Shafique,et al.  RISPP: Rotating Instruction Set Processing Platform , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[55]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[56]  Luigi Carro,et al.  Dynamic Reconfigurable Architectures and Transparent Optimization Techniques - Automatic Acceleration of Software Execution , 2010 .

[57]  Frank Vahid,et al.  A study of the speedups and competitiveness of FPGA soft processor cores using dynamic hardware/software partitioning , 2005, Design, Automation and Test in Europe.

[58]  Luigi Carro,et al.  Towards an Adaptable Multiple-ISA Reconfigurable Processor , 2011, ARC.

[59]  Luigi Carro,et al.  Object-Oriented Reconfiguration , 2007, 18th IEEE/IFIP International Workshop on Rapid System Prototyping (RSP '07).

[60]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[61]  Scott A. Mahlke,et al.  Automated custom instruction generation for domain-specific processor acceleration , 2005, IEEE Transactions on Computers.

[62]  Luigi Carro,et al.  CACO-PS: a general purpose cycle-accurate configurable power simulator , 2003, 16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings..

[63]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Optimization , 2001, IEEE Trans. Computers.

[64]  Scott A. Mahlke,et al.  Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[65]  Frank Vahid,et al.  Warp Processors , 2006, ACM Trans. Design Autom. Electr. Syst..

[66]  Luigi Carro,et al.  A low-energy approach for context memory in reconfigurable systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[67]  John Yates,et al.  FX!32 a profile-directed binary translator , 1998, IEEE Micro.

[68]  Luigi Carro,et al.  A VLIW low power Java processor for embedded applications , 2004, Proceedings. SBCCI 2004. 17th Symposium on Integrated Circuits and Systems Design (IEEE Cat. No.04TH8784).

[69]  Michael Gschwind,et al.  Binary translation and architecture convergence issues for IBM system/390 , 2000, ICS '00.

[70]  Erik R. Altman,et al.  Advances and future challenges in binary translation and optimization , 2001, Proc. IEEE.

[71]  Erik R. Altman,et al.  LaTTe: a Java VM just-in-time compiler with fast and efficient register allocation , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[72]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[73]  Maurício L. Pilla,et al.  A Speculative Trace Reuse Architecture with Reduced Hardware Requirements , 2006, 2006 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'06).

[74]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .