SD3: An Efficient Dynamic Data-Dependence Profiling Mechanism

As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important program analysis technique to exploit parallelism in serial programs. More specifically, manual, semiautomatic, or automatic parallelization can use the outcomes of data-dependence profiling to guide where and how to parallelize in a program. However, state-of-the-art data-dependence profiling techniques consume extremely huge resources as they suffer from two major issues when profiling large and long-running applications: 1) runtime overhead and 2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications with a typical resource budget or only report very limited information. In this paper, we propose an efficient approach to data-dependence profiling that can address both runtime and memory overhead in a single framework. Our technique, called SD3, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD3 reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1× and 9.7× on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile 22 SPEC 2006 benchmarks with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20× improvement in memory consumption and a 16× speedup in profiling time when 32 cores are used. We also demonstrate the usefulness of SD3 by showing manual parallelization followed by data dependence profiling results.

[1]  Dirk Grunwald,et al.  Shadow Profiling: Hiding Instrumentation Costs with Parallelism , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Kim M. Hazelwood,et al.  SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[4]  Chen Yang,et al.  A cost-driven compilation framework for speculative parallelization of sequential programs , 2004, PLDI '04.

[5]  Xiangyu Zhang,et al.  Whole Execution Traces , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[6]  Rajesh Bordawekar,et al.  Modeling optimistic concurrency using quantitative dependence analysis , 2008, PPOPP.

[7]  Hyesoon Kim,et al.  SD3: A Scalable Approach to Dynamic Data-Dependence Profiling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[9]  Björn Franke,et al.  Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Qin Zhao,et al.  Pipa: pipelined profiling and analysis on multi-core systems , 2008, CGO 2008.

[11]  J. Larus Whole program paths , 1999, PLDI '99.

[12]  Koen De Bosschere,et al.  A profile-based tool for finding pipeline parallelism in sequential programs , 2010, Parallel Comput..

[13]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[14]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[15]  James R. Larus,et al.  Loop-Level Parallelism in Numeric and Symbolic Programs , 1993, IEEE Trans. Parallel Distributed Syst..

[16]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[17]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[18]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[19]  Venkatesan T. Chakaravarthy New results on the computability and complexity of points--to analysis , 2003, POPL '03.

[20]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[21]  Kleanthis Psarris,et al.  The I Test: An Improved Dependence Test for Automatic Parallelization and Vectorization , 1991, IEEE Trans. Parallel Distributed Syst..

[22]  Matthew Arnold,et al.  A concurrent dynamic analysis framework for multicore hardware , 2009, OOPSLA 2009.

[23]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[24]  Jin Lin,et al.  Data Dependence Profiling for Speculative Optimizations , 2004, CC.

[25]  Frank Mueller,et al.  Languages and Compilers for Parallel Computing , 2015, Lecture Notes in Computer Science.

[26]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[27]  Peng Wu,et al.  Compiler-Driven Dependence Profiling to Guide Program Parallelization , 2008, LCPC.

[28]  John Giacomoni,et al.  Visualizing potential parallelism in sequential programs , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Sally A. McKee,et al.  METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies , 2007, TOPL.

[30]  Peng Wu,et al.  Experiences of using a dependence profiler to assist parallelization for multi-cores , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[31]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).