TotalProf: a fast and accurate retargetable source code profiler

Profilers play an important role in software/hardware design, optimization, and verification. Various approaches have been proposed to implement profilers. The most widespread approach adopted in the embedded domain is Instruction Set Simulation (ISS) based profiling, which provides uncompromised accuracy but limited execution speed. Source code profilers, on the contrary, are fast but less accurate. This paper introduces TotalProf, a fast and accurate source code cross profiler that estimates the performance of an application from three aspects: First, code optimization and a novel virtual compiler backend are employed to resemble the course of target compilation. Second, an optimistic static scheduler is introduced to estimate the behavior of the target processor's datapath. Last but not least, dynamic events, such as cache misses, bus contention and branch prediction failures, are simulated at runtime. With an abstract architecture description, the tool can be easily retargeted in a performance characteristics oriented way to estimate different processor architectures, including DSPs and VLIW machines. Multiple instances of TotalProf can be integrated with SystemC to support heterogeneous Multi-Processor System-on-Chip (MPSoC) profiling. With only about a 5 to 15% error rate introduced to the major performance metrics, such as cycle count, memory accesses and cache misses, a more than one Giga-Instruction-Per-Second (GIPS) execution speed is achieved.

[1]  Y. N. Srikant,et al.  A programmable hardware path profiler , 2005, International Symposium on Code Generation and Optimization.

[2]  Rainer Leupers,et al.  MAPS: An integrated framework for MPSoC application parallelization , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[3]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[4]  Kingshuk Karuri,et al.  Fine-grained application source code profiling for ASIP design , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[5]  Edwin A. Harcourt,et al.  Compilation-based software performance estimation for system level design , 2000, Proceedings IEEE International High-Level Design Validation and Test Workshop (Cat. No.PR00786).

[6]  Lance M. Berc,et al.  Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..

[7]  Daniel Gajski,et al.  Cycle-approximate Retargetable Performance Estimation at the Transaction Level , 2008, 2008 Design, Automation and Test in Europe.

[8]  Alberto L. Sangiovanni-Vincentelli,et al.  A compilation-based software estimation scheme for hardware/software co-simulation , 1999, Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450).

[9]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[10]  Nigel P. Topham,et al.  High Speed CPU Simulation Using LTU Dynamic Binary Translation , 2009, HiPEAC.

[11]  Gert Goossens,et al.  nML: A Structural Processor Modeling Language for Retargetable Compilation and ASIP Design , 2008 .

[12]  Xinping Zhu,et al.  A multiprocessing approach to accelerate retargetable and portable dynamic-compiled instruction-set simulation , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[13]  Nikil Dutt,et al.  Processor Description Languages , 2008 .

[14]  Lei Gao,et al.  A fast and generic hybrid simulation approach using C virtual machine , 2007, CASES '07.

[15]  Frédéric Pétrot,et al.  Automatic instrumentation of embedded software for high level hardware/software co-simulation , 2009, 2009 Asia and South Pacific Design Automation Conference.

[16]  Aryabartta Sahu,et al.  A generic platform for estimation of multi-threaded program performance on heterogeneous multiprocessors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[17]  Jianwen Zhu,et al.  An ultra-fast instruction set simulator , 2002, IEEE Trans. Very Large Scale Integr. Syst..

[18]  Brad Calder,et al.  Discovering and Exploiting Program Phases , 2003, IEEE Micro.

[19]  Lieven Eeckhout,et al.  Performance analysis through synthetic trace generation , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[20]  Eric Cheung,et al.  Fast and accurate performance simulation of embedded software for MPSoC , 2009, 2009 Asia and South Pacific Design Automation Conference.

[21]  Wolfgang Rosenstiel,et al.  High-performance timing simulation of embedded software , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[22]  Chanik Park,et al.  L4oprof: a performance-monitoring-unit-based software-profiling framework for the L4 microkernel , 2007, OPSR.

[23]  G. Braun,et al.  A universal technique for fast and flexible instruction-set architecture simulation , 2002, Proceedings 2002 Design Automation Conference (IEEE Cat. No.02CH37324).

[24]  Nikil D. Dutt,et al.  Instruction set compiled simulation: a technique for fast and flexible instruction set simulation , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[25]  Alberto L. Sangiovanni-Vincentelli,et al.  Source-Level Timing Annotation and Simulation for a Heterogeneous Multiprocessor , 2008, 2008 Design, Automation and Test in Europe.

[26]  Nikil D. Dutt,et al.  EXPRESSION: a language for architecture exploration through compiler/simulator retargetability , 1999, Design, Automation and Test in Europe Conference and Exhibition, 1999. Proceedings (Cat. No. PR00078).

[27]  Daniel Bartholomew QEMU: a multihost, multitarget emulator , 2006 .

[28]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[29]  Koen De Bosschere,et al.  Link-time optimization of ARM binaries , 2004, LCTES '04.

[30]  Dirk Grunwald,et al.  Shadow Profiling: Hiding Instrumentation Costs with Parallelism , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[31]  Matt T. Yourst PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[32]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[33]  Andreas Gerstlauer,et al.  Retargetable profiling for rapid, early system-level design space exploration , 2004, Proceedings. 41st Design Automation Conference, 2004..