Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics
暂无分享,去创建一个
[1] James C. Browne,et al. Performance Optimization of Data Structures Using Memory Access Characterization , 2011, 2011 IEEE International Conference on Cluster Computing.
[2] Donald Yeung,et al. Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.
[3] Chen Ding,et al. Program locality analysis using reuse distance , 2009, TOPL.
[4] Kristof Beyls,et al. Reuse Distance as a Metric for Cache Behavior. , 2001 .
[5] Philippe Olivier Alexandre Navaux,et al. Communication in Shared Memory: Concepts, Definitions, and Efficient Detection , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).
[6] Philippe Olivier Alexandre Navaux,et al. Locality and Balance for Communication-Aware Thread Mapping in Multicore Systems , 2015, Euro-Par.
[7] Xipeng Shen,et al. Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.
[8] Susan L. Graham,et al. Gprof: A call graph execution profiler , 1982, SIGPLAN '82.
[9] Simon W. Moore,et al. A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[10] Ayal Zaks,et al. Parcae: a system for flexible parallel execution , 2012, PLDI.
[11] Anoop Gupta,et al. The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.
[12] Sean Peisert,et al. Fingerprinting Communication and Computation on HPC Machines , 2010 .
[13] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[14] Matt Bishop,et al. Multiclass classification of distributed memory parallel computations , 2012, Pattern Recognit. Lett..
[15] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.
[16] Derek L. Schuff,et al. Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[17] Ahmad Faraj,et al. Communication Characteristics in the NAS Parallel Benchmarks , 2002, IASTED PDCS.
[18] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[19] Philippe Olivier Alexandre Navaux,et al. An Efficient Algorithm for Communication-Based Task Mapping , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[20] Philippe Olivier Alexandre Navaux,et al. Locality vs. Balance: Exploring Data Mapping Policies on NUMA Systems , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[21] Philippe Olivier Alexandre Navaux,et al. Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[22] David Eklov,et al. StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[23] Milind Kulkarni,et al. Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[24] Milind Kulkarni,et al. Towards architecture independent metrics for multicore performance analysis , 2011, PERV.
[25] I. Lee,et al. Characterizing communication patterns of NAS-MPI benchmark programs , 2009, IEEE Southeastcon 2009.
[26] Kristof Beyls,et al. Generating cache hints for improved program efficiency , 2005, J. Syst. Archit..
[27] Zhen Liu,et al. Lightweight monitoring of MPI programs in real time , 2005, Concurr. Comput. Pract. Exp..
[28] Philippe Olivier Alexandre Navaux,et al. Characterizing communication and page usage of parallel applications for thread and data mapping , 2015, Perform. Evaluation.
[29] Matt Bishop,et al. Network-theoretic classification of parallel computation patterns , 2011, Int. J. High Perform. Comput. Appl..
[30] Abdolreza Mirzaei,et al. Characterizing Loop-Level Communication Patterns in Shared Memory , 2015, 2015 44th International Conference on Parallel Processing.
[31] J. Shalf,et al. Understanding ultra-scale application communication requirements , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..
[32] Kai Li,et al. PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors , 2008, 2008 IEEE International Symposium on Workload Characterization.
[33] Zhen Li,et al. An Efficient Data-Dependence Profiler for Sequential and Parallel Programs , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[34] P. Sadayappan,et al. PARDA: A Fast Parallel Reuse Distance Analysis Algorithm , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[35] YeungDonald,et al. Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs , 2013 .