A flexible shared library profiler for early estimation of performance gains in heterogeneous systems

The effective acceleration of computationally demanding applications in heterogeneous systems often requires significant optimization efforts. Although such task typically starts with a thorough profiling stage, a special attention must be given to the migration procedure of each application kernel: apart from the actual computation time, the cost of the data transfers between the main processor memory and the accelerator plays a significant role, which often limits the actual resulting speedup. In some cases, no performance gain is actually achieved, given the excessively high communication to computation ratio. To ease the system designer effort, this paper proposes a framework that transparently collects extensive profile information, including, but not limited to, the values of the processor performance counters, as well as an estimation of the amounts of data to be transferred to and from the accelerator. The framework focuses on transparent acceleration of kernels implemented as library functions and is based on the shared library interposing technique. By further processing of the obtained execution profiles, together with the proper communication and computation models, the attainable global speedup of the accelerated application is predicted. The presented methods were validated experimentally for a set of existing applications. The measured global speedup estimation error typically ranged between 1 and 4%.

[1]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[2]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[3]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[4]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[5]  Greg Stitt,et al.  A scalable performance prediction heuristic for implementation planning on heterogeneous systems , 2010, 2010 8th IEEE Workshop on Embedded Systems for Real-Time Multimedia.

[6]  Hiroaki Kobayashi,et al.  A History-Based Performance Prediction Model with Profile Data Classification for Automatic Task Allocation in Heterogeneous Computing Systems , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[7]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[8]  Timothy W. Curry,et al.  Profiling and Tracing Dynamic Library Usage Via Interposition , 1994, USENIX Summer.

[9]  Tobias Beisel,et al.  Using shared library interposing for transparent application acceleration in systems with heterogeneous hardware accelerators , 2010, ASAP 2010 - 21st IEEE International Conference on Application-specific Systems, Architectures and Processors.

[10]  John Viega,et al.  Network Security with OpenSSL , 2002 .

[11]  John W. Eaton,et al.  GNU Octave Manual Version 3 , 2008 .

[12]  Luis Ibáñez,et al.  The ITK Software Guide , 2005 .

[13]  Zhen Xiao,et al.  A flexible generator architecture for improving software dependability , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[14]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[15]  J. Demmel,et al.  Sun Microsystems , 1996 .

[17]  M. E. Galassi,et al.  GNU SCIENTI C LIBRARY REFERENCE MANUAL , 2005 .

[18]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[19]  Jack Dongarra,et al.  Preface: Basic Linear Algebra Subprograms Technical (Blast) Forum Standard , 2002 .