Determining Performance Boundaries and Automatic Loop Optimization of High-Level System Specifications

Designers are confronted with high time-to-market pressure and an increasing demand for computational power. As a result, they are required to identify as early as possible the quality of a specification for an intended technology. The designer needs to know if this specification can be improved, and at what cost. Specification trade-offs are often based on the experience and intuition of a designer, which in itself is not enough to make design decisions given the complexity of modern designs. Therefore, we need to identify the performance boundaries for the execution of a specification on an intended technology. The degree of parallelism, required resources, scheduling constraints, and possible optimizations, etc. are essential in determining design trade-offs (e.g., power consumption, execution time, etc). However, existing tools lack the capability of determining relevant performance parameters and the option to automatically optimize high-level specifications to make meaningful design trade-offs. To address these problems, we present in this thesis a new profiler tool, cprof. The Clang compiler front-end is used in this tool to parse high-level specifications, and to produce instrumented source code for the purpose of profiling. This tool automatically determines, from high-level specifications, the degree of parallelism of a given source code, specified in C and C++ programming languages. Furthermore, cprof estimates the number of clock cycles necessary to complete a program, it automatically applies loop optimization techniques, it determines the lower and upper bound on throughput capacity, and finally, it generates hardware execution traces. The tool assumes that the specification is executed on a parallel Model of Computation (MoC), referred to as a Polyhedral Process Network (PPN). The proposed tool adds new functionality to existing technologies: the estimated performance by cprof of PolyBench/C benchmarks, as compared to realistic implementations in Field-Programmable Gate Arrays (FPGA) platforms, showed to be almost identical. Cprof is capable of estimating the lower and upper bound on throughput capacity, making it possible for the designer to make performance trade-offs based on real design points. As a result, only the high-level specification is used by cprof to assist in Design Space Exploration (DSE) and to improve design quality.

[1]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[2]  Saturnino Garcia,et al.  Kremlin: like gprof, but for parallelization , 2011, PPoPP '11.

[3]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Ana Balevic,et al.  Exploiting multi-level parallelism in streaming applications for heterogeneous platforms with GPUs , 2013 .

[5]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[6]  Edward D. Lazowska,et al.  Speedup Versus Efficiency in Parallel Systems , 1989, IEEE Trans. Computers.

[7]  Saturnino Garcia,et al.  Parkour: Parallel Speedup Estimates for Serial Programs , 2011, HotPar.

[8]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[9]  Jerónimo Castrillón Mazo Programming heterogeneous MPSoCs: tool flows to close the software productivity gap , 2013 .

[10]  Koen De Bosschere,et al.  A profile-based tool for finding pipeline parallelism in sequential programs , 2010, Parallel Comput..

[11]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[12]  Alexandru Turjan,et al.  Translating affine nested-loop programs to process networks , 2004, CASES '04.

[13]  Xingfu Wu,et al.  Performance Evaluation, Prediction and Visualization of Parallel Systems , 1999, The Kluwer International Series on Asian Studies in Computer and Information Science.

[14]  Sven Verdoolaege,et al.  Polyhedral Process Networks , 2010, Handbook of Signal Processing Systems.

[15]  Allen D. Malony,et al.  ParaProf: A Portable, Extensible, and Scalable Tool for Parallel Performance Profile Analysis , 2003, Euro-Par.

[16]  Roel Meeuws Quantitative hardware prediction modeling for hardware/software co-design , 2012 .

[17]  Manoj Kumar,et al.  Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications , 1988, IEEE Trans. Computers.

[18]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[19]  Ralph Duncan,et al.  A survey of parallel computer architectures , 1990, Computer.

[20]  Gilles Kahn,et al.  Coroutines and Networks of Parallel Processes , 1977, IFIP Congress.

[21]  Lei Gao,et al.  TotalProf: a fast and accurate retargetable source code profiler , 2009, CODES+ISSS '09.

[22]  Keshab K. Parhi,et al.  VLSI digital signal processing systems , 1999 .

[23]  Diomidis Spinellis,et al.  Global Analysis and Transformations in Preprocessed Languages , 2003, IEEE Trans. Software Eng..

[24]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[25]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[26]  Henk Corporaal,et al.  Parallelization of while loops in nested loop programs for shared-memory multiprocessor systems , 2011, 2011 Design, Automation & Test in Europe.

[27]  Sjoerd Meijer,et al.  Transformations for polyhedral process networks , 2010 .

[28]  Saturnino Garcia,et al.  Kismet: parallel speedup estimates for serial programs , 2011, OOPSLA '11.

[29]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[30]  van Haastregt,et al.  Estimation and optimization of the performance of polyhedral process networks , 2013 .

[31]  William G. Griswold,et al.  The design of whole-program analysis tools , 1996, Proceedings of IEEE 18th International Conference on Software Engineering.

[32]  Koen Bertels,et al.  QUAD - A Memory Access Pattern Analyser , 2010, ARC.

[33]  Ed F. Deprettere,et al.  Compaan: deriving process networks from Matlab for embedded signal processing architectures , 2000, CODES '00.

[34]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[35]  John L. Hennessy,et al.  The Future of Systems Research , 1999, Computer.

[36]  Melanie Kambadur,et al.  Harmony: Collection and analysis of parallel block vectors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[37]  Andrei Alexandrescu,et al.  Modern C++ design: generic programming and design patterns applied , 2001 .

[38]  Arjan J. C. van Gemund Performance Modeling of Parallel Systems , 1996 .

[39]  J. Larus Whole program paths , 1999, PLDI '99.

[40]  Björn Karlsson,et al.  Beyond the C++ Standard Library: An Introduction to Boost , 2005 .

[41]  Zhen Li,et al.  Discovery of Potential Parallelism in Sequential Programs , 2013, 2013 42nd International Conference on Parallel Processing.

[42]  K. Bertels,et al.  Profile-guided application partitioning for heterogeneous reconfigurable platforms , 2012, The 16th CSI International Symposium on Computer Architecture and Digital Systems (CADS 2012).

[43]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[44]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[45]  A. Chahar,et al.  Compile time aanalysis for hardware transactional memory architectures , 2012 .

[46]  Ed F. Deprettere,et al.  Daedalus: Toward composable multimedia MP-SoC design , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[47]  Giovanni De Micheli,et al.  Synthesis and Optimization of Digital Circuits , 1994 .

[48]  Todor Stefanov,et al.  Translating affine nested-loop programs with dynamic loop bounds into Polyhedral Process Networks , 2010, 2010 8th IEEE Workshop on Embedded Systems for Real-Time Multimedia.