Top-Down Characterization Approximation based on performance counters architecture for AMD processors

Abstract Due to the increasing complexity of the processors, developers often seek for tools that would simplify the process of finding bottlenecks while executing applications. Although more and more data may be collected from processors, usually much detailed knowledge about the internals of a given architecture is required to understand them. This paper introduces a Top-Down Characterization Approximation for the analysis of applications performance executed on AMD processors and is an extension of a Top-Down Method initially developed by Intel. Since not all required performance counters are available on AMD processors to calculate the exact values of metrics, this method was named as an approximation. It allows one to get a deeper understanding of different stages of program execution, compare different architectures and identify bottlenecks in out-of-order processors. It hides from the user the complexity of microarchitecture details and at the same time exposes the main contributors of inefficient program execution. This method aims at defining a few main metrics on top of performance counters to easily locate the main efficiency issues. At this time this method was applied to Intel processors only. The main reason behind it was the fact that it uses designated performance counters that are unique among different processors and its portability is not straightforward. Positive feedback from users encouraged the authors to develop a similar technique for AMD processors.

[1]  Stijn Eyerman,et al.  A Counter Architecture for Online DVFS Profitability Estimation , 2010, IEEE Transactions on Computers.

[2]  Jeffrey Dean,et al.  ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Qi Luo,et al.  Automating performance bottleneck detection using search-based application profiling , 2015, ISSTA.

[4]  James E. Smith,et al.  Automated design of application specific superscalar processors: an analytical approach , 2007, ISCA '07.

[5]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[6]  James E. Smith,et al.  A performance counter architecture for computing accurate CPI components , 2006, ASPLOS XII.

[7]  David Eklöv A Profiling Method for Analyzing Scalability Bottlenecks on Multicores , 2012 .

[8]  Jaewon Lee,et al.  RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Rastislav Bodík,et al.  Interaction cost and shotgun profiling , 2004, TACO.

[10]  Stijn Eyerman,et al.  Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[11]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13]  Stefanos Kaxiras,et al.  Interval-based models for run-time DVFS orchestration in superscalar processors , 2010, CF '10.

[14]  Martin Schulz,et al.  Practical performance prediction under Dynamic Voltage Frequency Scaling , 2011, 2011 International Green Computing Conference and Workshops.

[15]  Li Shen,et al.  Implementing a Leading Loads Performance Predictor on Commodity Processors , 2014, USENIX Annual Technical Conference.