Trace-Based Dynamic Binary Parallelization

With the number of cores increasing rapidly but the performance per core increasing slowly at best, software must be parallelized in order to improve performance. Manual parallelization is often prohibitively time-consuming and error-prone (especially due to data races and memory-consistency complexities), and some portions of code may simply be too difficult to understand or refactor for parallelization. Most existing automatic parallelization techniques are performed statically at compile time and require source code to be analyzed, leaving a large fraction of software behind. In many cases, some or all of the source code and development tool chain is lost or, in the case of third-party software, was never available. Furthermore, modern applications are assembled and defined at run time, making use of shared libraries, virtual functions, plugins, dynamically-generated code, and other dynamic mechanisms, as well as multiple languages. All these aspects of separate compilation prevent the compiler from obtaining a holistic view of the program, leading to the risk of incompatible parallelization techniques, subtle data races, and resource over-subscription. All the above considerations motivate dynamic binary parallelization (DBP). This dissertation explores the novel idea of trace-based DBP, which provides a large instruction window without introducing spurious dependencies. We hypothesize that traces provide a generally good trade-off between code visibility and analysis accuracy for a wide variety of applications so as to achieve better parallel performance. Compared to the raw dynamic instruction stream (DIS), traces expose more distant parallelism opportunities because their average length is typically much larger than the size of the hardware instruction window. Compared to the complete control flow graph (CFG), traces only contain control and data dependencies on the execution path which is actually taken. More importantly, while DIS-based DBP typically only exploits fine-grained parallelism and CFG-based DBP typically only exploits coarse-grained parallelism, traces can be used as a unified representation of program execution to seamlessly incorporate the exploitation of both coarseand fine-grained parallelism. We develop Tracy, an innovative DBP framework which monitors a program at run time and

[1]  Zhao Zhang,et al.  Software thermal management of dram memory for multicore systems , 2008, SIGMETRICS '08.

[2]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[3]  Keith D. Cooper,et al.  An Experimental Evaluation of List Scheduling , 1998 .

[4]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[5]  Guang R. Gao,et al.  Identifying loops using DJ graphs , 1996, TOPL.

[6]  Jing Yang,et al.  Dimension: an instrumentation tool for virtual execution environments , 2006, VEE '06.

[7]  Antonia Zhai,et al.  Improving value communication for thread-level speculation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[8]  Rajeev Barua,et al.  Automatic Parallelization in a Binary Rewriter , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[10]  Teresa H. Meng,et al.  Embracing heterogeneity: parallel programming for changing hardware , 2009 .

[11]  Gary A. Kildall,et al.  A unified approach to global program optimization , 1973, POPL.

[12]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[13]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Optimization , 2001, IEEE Trans. Computers.

[14]  Nathan Clark Why Should I Rewrite My Software When Dynamic Compilation Can Be Good Enough ? , 2008 .

[15]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[16]  Jason Mars,et al.  MATS : Multicore Adaptive Trace Selection , 2008 .

[17]  Rajiv Gupta,et al.  Copy or Discard execution model for speculative parallelization on multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[18]  Babak Falsafi,et al.  Flexible Hardware Acceleration for Instruction-Grain Program Monitoring , 2008, 2008 International Symposium on Computer Architecture.

[19]  Wen-mei W. Hwu,et al.  Automatic Discovery of Coarse-Grained Parallelism in Media Applications , 2007, Trans. High Perform. Embed. Archit. Compil..

[20]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Scott A. Mahlke,et al.  Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[22]  Gregory J. Chaitin,et al.  Register allocation and spilling via graph coloring , 2004, SIGP.

[23]  Cheng Wang,et al.  Selective Runtime Memory Disambiguation in a Dynamic Binary Translator , 2006, CC.

[24]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[25]  Kevin Skadron,et al.  Federation: Repurposing scalar cores for out-of-order instruction issue , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[26]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[27]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[28]  Gurindar S. Sohi,et al.  Speculative Multithreaded Processors , 2001, Computer.

[29]  Michael D. Smith,et al.  Generational Cache Management of Code Traces in Dynamic Optimization Systems , 2003, MICRO.

[30]  Dirk Grunwald,et al.  Instruction fetch mechanisms for multipath execution processors , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[31]  Wei Hu,et al.  Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems , 2007, CGO.

[32]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[33]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[34]  Urs Hölzle,et al.  High-efficiency power supplies for home computers and servers , 2006 .

[35]  Sriram Sankaranarayanan,et al.  Integrating ICP and LRA solvers for deciding nonlinear real arithmetic problems , 2010, Formal Methods in Computer Aided Design.

[36]  Michael Franz,et al.  Dynamic parallelization and mapping of binary executables on hierarchical platforms , 2006, CF '06.

[37]  Weifeng Zhang,et al.  An event-driven multithreaded dynamic optimization framework , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[38]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[39]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[40]  Rastislav Bodík,et al.  Path-sensitive value-flow analysis , 1998, POPL '98.

[41]  Michael Gschwind The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor , 2007, International Journal of Parallel Programming.

[42]  Tarek S. Abdelrahman,et al.  The use of hardware transactional memory for the trace-based parallelization of recursive Java programs , 2009, PPPJ '09.

[43]  Tarek S. Abdelrahman,et al.  Automatic Trace-Based Parallelization of Java Programs , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[44]  Martin Burtscher,et al.  VPC3: a fast and effective trace-compression algorithm , 2004, SIGMETRICS '04/Performance '04.

[45]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[46]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[47]  Qin Zhao,et al.  Pipa: pipelined profiling and analysis on multi-core systems , 2008, CGO 2008.

[48]  Andreas Podelski,et al.  Thread-Modular Counterexample-Guided Abstraction Refinement , 2010, SAS.

[49]  Tarek S. Abdelrahman,et al.  The potential of trace-level parallelism in Java programs , 2007, PPPJ.

[50]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[51]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[52]  Mark Stephenson,et al.  Convergent scheduling , 2002, MICRO 35.

[53]  C MowryTodd,et al.  Flexible Hardware Acceleration for Instruction-Grain Program Monitoring , 2008 .

[54]  Westley Weimer,et al.  The road not taken: Estimating path execution frequency statically , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[55]  Sanjay J. Patel,et al.  The Performance Potential of Trace-based Dynamic Optimization , 2004 .

[56]  James E. Smith,et al.  Path-based next trace prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[57]  Seon Wook Kim,et al.  Runtime parallelization of legacy code on a transactional memory system , 2011, HiPEAC.

[58]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[59]  Kunle Olukotun,et al.  Runtime automatic speculative parallelization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[60]  Apala Guha,et al.  Balancing memory and performance through selective flushing of software code caches , 2010, CASES '10.

[61]  Wei Liu,et al.  Thread-Level Speculation on a CMP can be energy efficient , 2005, ICS '05.

[62]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[63]  Scott A. Mahlke,et al.  Compiler-managed partitioned data caches for low power , 2007, LCTES '07.

[64]  Guilherme Ottoni,et al.  Global Multi-Threaded Instruction Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[65]  Lingjia Tang,et al.  Directly characterizing cross core interference through contention synthesis , 2011, HiPEAC.

[66]  Gary S. Tyson,et al.  Region-based caching: an energy-delay efficient memory architecture for embedded processors , 2000, CASES '00.

[67]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[68]  J. Larus Whole program paths , 1999, PLDI '99.

[69]  Anant Agarwal,et al.  Scalar operand networks: on-chip interconnect for ILP in partitioned architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[70]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[71]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[72]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[73]  Stefania Perri,et al.  Fast Low-Cost Implementation of Single-Clock-Cycle Binary Comparator , 2008, IEEE Transactions on Circuits and Systems II: Express Briefs.

[74]  Chen Ding,et al.  Software behavior oriented parallelization , 2007, PLDI '07.

[75]  Chi-Keung Luk,et al.  Memory disambiguation for general-purpose applications , 1995, CASCON.

[76]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[77]  Rajeev Balasubramonian,et al.  Towards scalable, energy-efficient, bus-based on-chip networks , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[78]  Easwaran Raman,et al.  Parallel-stage decoupled software pipelining , 2008, CGO '08.

[79]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[80]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[81]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[82]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[83]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[84]  Manoj Franklin,et al.  A general compiler framework for speculative multithreading , 2002, SPAA '02.

[85]  Scott A. Mahlke,et al.  Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[86]  Anshuman Dasgupta Vizer: A framework to analyze and vectorize Intel x86 binaries , 2003 .

[87]  Margaret Martonosi,et al.  Multipath execution: opportunities and limits , 1998, ICS '98.

[88]  Thomas A. Henzinger,et al.  The Blast Query Language for Software Verification , 2004, SAS.

[89]  Greg Grohoski Niagara-2: A highly threaded server-on-a-chip , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[90]  Derek Bruening,et al.  Secure Execution via Program Shepherding , 2002, USENIX Security Symposium.

[91]  Diego R. Llanos Ferraris,et al.  Just-In-Time Scheduling for Loop-based Speculative Parallelization , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[92]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[93]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[94]  Rajeev Barua,et al.  An optimal memory allocation scheme for scratch-pad-based embedded systems , 2002, TECS.

[95]  Kevin Skadron,et al.  Characterizing and removing branch mispredictions , 1999 .

[96]  Jack W. Davidson,et al.  Evaluating fragment construction policies for SDT systems , 2006, VEE '06.

[97]  Weng-Fai Wong,et al.  Cooperative Instruction Scheduling with Linear Scan Register Allocation , 2005, HiPC.

[98]  Brad Calder,et al.  Threaded multiple path execution , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[99]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[100]  Michael J. Quinn,et al.  Parallel programming in C with MPI and OpenMP , 2003 .

[101]  Antonio González,et al.  Clustered speculative multithreaded processors , 1999, ICS '99.

[102]  Philippe Clauss,et al.  Polyhedral parallelization of binary code , 2012, TACO.

[103]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[104]  Wen-mei W. Hwu,et al.  A hardware mechanism for dynamic extraction and relayout of program hot spots , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[105]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[106]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[107]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[108]  William J. Dally,et al.  Evaluating the Imagine stream architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[109]  Nancy M. Amato,et al.  STAPL: standard template adaptive parallel library , 2010, SYSTOR '10.

[110]  Xiangyu Zhang,et al.  Whole Execution Traces , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[111]  Malay K. Ganai,et al.  Efficient decision procedure for non-linear arithmetic constraints using CORDIC , 2009, 2009 Formal Methods in Computer-Aided Design.

[112]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, MICRO 33.

[113]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[114]  Sorin Lerner,et al.  ESP: path-sensitive program verification in polynomial time , 2002, PLDI '02.

[115]  Scott A. Mahlke,et al.  Superblock formation using static program analysis , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[116]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[117]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[118]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[119]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[120]  Nathan Clark,et al.  Commutativity analysis for software parallelization: letting program transformations see the big picture , 2009, ASPLOS.

[121]  Keith D. Cooper,et al.  Coloring register pairs , 1992, LOPL.

[122]  Rakesh Ranjan,et al.  Fg-STP: Fine-Grain Single Thread Partitioning on Multicores , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[123]  Takashi Yokota,et al.  Preliminary evaluation of a binary translation system for multithreaded processors , 2002, International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[124]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[125]  Kwong-Sak Leung,et al.  CPE: a parallel library for financial engineering applications , 2005, Computer.

[126]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[127]  Wei Liu,et al.  Dynamic parallelization of single-threaded binary programs using speculative slicing , 2009, ICS.

[128]  Glenn Reinman,et al.  Selective value prediction , 1999, ISCA.

[129]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[130]  Jack W. Davidson,et al.  Secure and practical defense against code-injection attacks using software dynamic translation , 2006, VEE '06.

[131]  Gang Chen,et al.  Effective instruction scheduling with limited registers , 2001 .