Peruse and Profit: Estimating the Accelerability of Loops

There exist a multitude of execution models available today for a developer to target. The choices vary from general purpose processors to fixed-function hardware accelerators with a large number of variations in-between. There is a growing demand to assess the potential benefits of porting or rewriting an application to a target architecture in order to fully exploit the benefits of performance and/or energy efficiency offered by such targets. However, as a first step of this process, it is necessary to determine whether the application has characteristics suitable for acceleration. In this paper, we present Peruse, a tool to characterize the features of loops in an application and to help the programmer understand the amenability of loops for acceleration. We consider a diverse set of features ranging from loop characteristics (e.g., loop exit points) and operation mixes (e.g., control vs data operations) to wider code region characteristics (e.g., idempotency, vectorizability). Peruse is language, architecture, and input independent and uses the intermediate representation of compilers to do the characterization. Using static analyses makes Peruse scalable and enables analysis of large applications to identify and extract interesting loops suitable for acceleration. We show analysis results for unmodified applications from the SPEC CPU benchmark suite, Polybench, and HPC workloads. For an end-user it is more desirable to get an estimate of the potential speedup due to acceleration. We use the workload characterization results of Peruse as features and develop a machine-learning based model to predict the potential speedup of a loop when off-loaded to a fixed function hardware accelerator. We use the model to predict the speedup of loops selected by Peruse and achieve an accuracy of 79%.

[1]  Lei Zhang,et al.  A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications , 2014, Journal of Computer Science and Technology.

[2]  Graham R. Nudd,et al.  Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[3]  J. David Morgenthaler,et al.  Evaluating static analysis defect warnings on production software , 2007, PASTE '07.

[4]  Brad Calder,et al.  The Strong correlation Between Code Signatures and Performance , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[5]  Ron Cytron,et al.  Interprocedural dependence analysis and parallelization , 1986, SIGP.

[6]  Dean M. Tullsen,et al.  Data-triggered threads: Eliminating redundant computation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[7]  Lieven Eeckhout,et al.  Quantifying the Impact of Input Data Sets on Program Behavior and its Applications , 2003, J. Instr. Level Parallelism.

[8]  Paul B. Schneck,et al.  Automatic recognition of vector and parallel operations in a higher level language , 1972, SIGP.

[9]  Lieven Eeckhout,et al.  Performance prediction based on inherent program similarity , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[11]  P. Sadayappan,et al.  Dynamic trace-based analysis of vectorization potential of applications , 2012, PLDI.

[12]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[13]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[14]  Pang-Ning Tan,et al.  Receiver Operating Characteristic , 2009, Encyclopedia of Database Systems.

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  Mark N. Wegman,et al.  Constant propagation with conditional branches , 1985, POPL.

[17]  Ryan N. Rakvic,et al.  The Fuzzy Correlation between Code and Performance Predictability , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[18]  Michael Hicks,et al.  LOCKSMITH: Practical static race detection for C , 2011, TOPL.

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Somesh Jha,et al.  Static analysis and compiler design for idempotent processing , 2012, PLDI.

[21]  Rajeev Barua,et al.  AESOP : The Autoparallelizing Compiler for Shared Memory Computers , 2013 .

[22]  David M. Brooks,et al.  ISA-independent workload characterization and its implications for specialized architectures , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[23]  Jiayuan Meng,et al.  Improving GPU Performance Prediction with Data Transfer Modeling , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[24]  Stephen A. Edwards,et al.  Computation vs. memory systems: pinning down accelerator bottlenecks , 2010, ISCA'10.

[25]  Erik R. Altman,et al.  Predicting GPU Performance from CPU Runs Using Machine Learning , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[26]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[27]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[28]  Elie Shaccour ELI-C : A Loop-level Workload Characterization Tool , 2014 .

[29]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[30]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[31]  Karthikeyan Sankaralingam,et al.  Idempotent code generation: Implementation, analysis, and evaluation , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[32]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[33]  Scott B. Baden,et al.  Modeling and predicting application performance on hardware accelerators , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[35]  Pradeep Dubey,et al.  PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors , 2011, Proc. VLDB Endow..

[36]  Geoff Holmes,et al.  Multiclass Alternating Decision Trees , 2002, ECML.

[37]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[38]  Lizy Kurian John,et al.  A Performance Counter Based Workload Characterization on Blue Gene/P , 2008, 2008 37th International Conference on Parallel Processing.

[39]  William Pugh,et al.  Uniform techniques for loop optimization , 1991, ICS '91.

[40]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.

[41]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Kenneth A. Ross,et al.  Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.

[43]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[44]  Toshio Nakatani,et al.  A new idiom recognition framework for exploiting hardware-assist instructions , 2006, ASPLOS XII.

[45]  Thomas Fahringer Automatic Performance Prediction of Parallel Programs , 1996, Springer US.

[46]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[47]  Antonia Zhai,et al.  Exploring speculative parallelism in SPEC2006 , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[48]  Laura Carrington,et al.  PIR: PMaC's Idiom Recognizer , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[49]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[50]  T. K. Prakash,et al.  Performance Characterization of SPEC CPU 2006 Benchmarks on Intel Core 2 Duo Processor , .

[51]  Saturnino Garcia,et al.  Kismet: parallel speedup estimates for serial programs , 2011, OOPSLA '11.

[52]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[54]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[55]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[56]  Daniel Cordes,et al.  A Fast and Precise Static Loop Analysis Based on Abstract Interpretation, Program Slicing and Polytope Models , 2009, 2009 International Symposium on Code Generation and Optimization.

[57]  Xiaojin Zhu,et al.  Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).