A data-centric profiler for parallel programs
暂无分享,去创建一个
[1] Barton P. Miller,et al. The Paradyn Parallel Performance Measurement Tool , 1995, Computer.
[2] Vivien Quéma,et al. MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.
[3] Barton P. Miller,et al. Mapping performance data for high-level and data views of parallel program performance , 1996, ICS '96.
[4] John M. Mellor-Crummey,et al. Pinpointing data locality problems using data-centric analysis , 2011, International Symposium on Code Generation and Optimization (CGO 2011).
[5] Kristof Beyls,et al. Refactoring for Data Locality , 2009, Computer.
[6] James C. Browne,et al. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[7] Collin McCurdy,et al. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[8] Susan L. Graham,et al. Gprof: A call graph execution profiler , 1982, SIGPLAN '82.
[9] Lance M. Berc,et al. Continuous profiling: where have all the cycles gone? , 1997, ACM Trans. Comput. Syst..
[10] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[11] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..
[12] Wentao Chang,et al. Sampling-based program locality approximation , 2008, ISMM '08.
[13] Nathan R. Tallent,et al. Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[14] Milind Kulkarni,et al. Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[15] Nathan Froyd,et al. Low-overhead call path profiling of unmodified, optimized code , 2005, ICS '05.
[16] Brian J. N. Wylie,et al. Memory Profiling using Hardware Counters , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[17] Margaret Martonosi,et al. MemSpy: analyzing memory system bottlenecks in programs , 1992, SIGMETRICS '92/PERFORMANCE '92.
[18] John M. Mellor-Crummey,et al. Pinpointing data locality bottlenecks with low overhead , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[19] David A. Wood,et al. Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.
[20] Kristof Beyls,et al. Discovery of Locality-Improving Refactorings by Reuse Path Analysis , 2006, HPCC.
[21] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[22] No License,et al. Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .
[23] Jeffrey Dean,et al. ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[24] Balaram Sinharoy,et al. IBM POWER7 performance modeling, verification, and evaluation , 2011 .
[25] Jeffrey K. Hollingsworth,et al. Data Centric Cache Measurement on the Intel ltanium 2 Processor , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[26] Nathan R. Tallent,et al. Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.