Real Time Cache Performance Analyzing for Multi-core Parallel Programs

Modern processors mostly use cache to hide the memory access latency, so cache performance is very important to application program. A detailed cache performance analysis will provide programmers a clear view of their program behaviors, which can help them to identify the performance bottleneck and to optimize the source code. As the chip industry turn to integrate multiple cores into one chip, multi-core/many-core processor becomes the new approach to maintain the Moor's Law. Therefore, Parallel programs turn to be more important even in the personal computers. In parallel programs, the interaction between tasks is the source of bugs and errors and is hard to handling for most of programmers. The detailed cache behaviors will greatly helpful to the programmer to find the errors and optimize the programs. However, the existing cache performance analysis tools, due to the limitations of the hardware performance counters they depend on to get data, cannot get as much data as we expected. Those tools cannot reveal the program routines characteristics on shared cache and the source of cache misses with limited metrics on cache misses. In this paper, we propose a method to obtain and analysis real time cache performance with binary instrumentation and cache emulation. We instrument the parallel program while it is running, and get the trace data about memory access. Then we transport the trace data to an carefully configured cache emulation module to get the detailed cache behavior information. The emulation module can not only get more information than hardware performance counter but also can be configured to simulate different target hardware environment. Additionally, we use the performance data to form a group of cache performance metrics which can intuitively help programmers to optimize their codes. The accuracy of this method is demonstrated by comparing the summary result with the hardware performance counter. Finally, we design an cache performance analysis tool named CC-Analyzer for parallel programs. Comparing with the existing technologies, CC-Analyzer is able to analyze the cause of cache misses and gather much more performance statistics when the parallel program is running on different cache architectures.

[1]  Xi Chen,et al.  Cache contention and application performance prediction for multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[2]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[3]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[4]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[5]  Bruce Jacob,et al.  The Memory System , 2017 .

[6]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[7]  Norman P. Jouppi,et al.  Multi-Core Cache Hierarchies , 2011, Multi-Core Cache Hierarchies.

[8]  F. Wolf,et al.  Performance Profiling and Analysis of DoD Applications Using PAPI and TAU , 2005, 2005 Users Group Conference (DOD-UGC'05).

[9]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[10]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[11]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[12]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[13]  Keshav Pingali,et al.  Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms , 2011, PPoPP '11.

[14]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[15]  Ravi R. Iyer On modeling and analyzing cache hierarchies using CASPER , 2003, 11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Telecommunications Systems, 2003. MASCOTS 2003..

[16]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[17]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .