论文信息 - Exploring performance and energy tradeoffs for irregular applications: A case study on the Tilera many-core architecture

Exploring performance and energy tradeoffs for irregular applications: A case study on the Tilera many-core architecture

High performance, parallel applications with irregular data accesses are becoming a critical workload class for modern systems. In particular, the execution of such workloads on emerging many-core systems is expected to be a significant component of applications in data mining, machine learning, scientific computing and graph analytics. However, power and energy constraints limit the capabilities of individual cores, memory hierarchy and on-chip interconnect of such systems, thus leading to architectural and software trade-offs that must be understood in the context of the intended applications behavior. Irregular applications are notoriously hard to optimize given their data-dependent access patterns, lack of structured locality and complex data structures and code patterns.We have ported two irregular applications, graph community detection using the Louvain method (Grappolo) and high-performance conjugate gradient (HPCCG), to the Tilera many-core system and have conducted a detailed study of platform-independent and platform-specific optimizations that improve their performance as well as reduce their overall energy consumption. To conduct this study, we employ an auto-tuning based approach that explores the optimization design space along three dimensionsmemory layout schemes, GCC compiler flag choices and OpenMP loop scheduling options. We leverage MITs OpenTuner auto-tuning framework to explore and recommend energy optimal choices for different combinations of parameters. We then conduct an in-depth architectural characterization to understand the memory behavior of the selected workloads. Finally, we perform a correlation study to demonstrate the interplay between the hardware behavior and application characteristics.Using auto-tuning, we demonstrate whole-node energy savings and performance improvements of up to 49.6% and 60% relative to a baseline instantiation, and up to 31% and 45.4% relative to manually optimized variants. Optimizing irregular applications for modern many-core architectures is challenging.Study energy and performance efficiency on Tilera using two irregular applications.Using auto-tuning, we explore the optimization design space along three dimensions.Show whole-node energy savings and performance improvements of up to 49.6% and 60%.

[1] Viktor K. Prasanna,et al. Rapid design space exploration of heterogeneous embedded systems using symbolic search and multi-granular simulation , 2002, LCTES/SCOPES '02.

[2] Jian Li,et al. Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[3] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[4] Sandia Report,et al. Improving Performance via Mini-applications , 2009 .

[5] Nicolai M. Josuttis. The C++ Standard Library: A Tutorial and Reference , 2012 .

[6] Ananta Tiwari,et al. Auto-tuning for Energy Usage in Scientific Applications , 2011, Euro-Par Workshops.

[7] Jean-Loup Guillaume,et al. Fast unfolding of communities in large networks , 2008, 0803.0476.

[8] Jian Li,et al. Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[9] Jason Helge Anderson,et al. The Effect of Compiler Optimizations on High-Level Synthesis for FPGAs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[10] I-Hsin Chung,et al. Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11] Kevin Skadron,et al. Performance, energy, and thermal considerations for SMT and CMP architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[12] M E J Newman,et al. Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13] Anantharaman Kalyanaraman,et al. Scaling graph community detection on the Tilera many-core architecture , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[14] Sameer Kulkarni,et al. Mitigating the compiler optimization phase-ordering problem using machine learning , 2012, OOPSLA '12.

[15] Vittorio Zaccaria,et al. Multicube Explorer: An Open Source Framework for Design Space Exploration of Chip Multi-Processors , 2010, ARCS Workshops.

[16] Alan B. Williams,et al. Poster: mini-applications: vehicles for co-design , 2011, SC '11 Companion.

[17] Changjun Wu,et al. pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs , 2012, IEEE Transactions on Parallel and Distributed Systems.

[18] Joseph Manzano,et al. Optimizing irregular applications for energy and performance on the Tilera many-core architecture , 2015, Conf. Computing Frontiers.

[19] John Shalf,et al. Exascale Operating Systems and Runtime Software Report , 2012 .

[20] Ramon Canal,et al. Design space exploration for multicore architectures: a power/performance/thermal view , 2006, ICS '06.

[21] John Cavazos,et al. Energy Auto-Tuning using the Polyhedral Approach , 2014 .

[22] Simone Secchi,et al. Special Issue on Architectures and Algorithms for Irregular Applications (AAIA) - Guest editors' introduction , 2015, J. Parallel Distributed Comput..

[23] Katherine Yelick,et al. The Optimized Sparse Kernel Interface (OSKI) Library User's Guide for Version 1.0.1h , 2007 .

[24] Simone Secchi,et al. Irregular applications: architectures & algorithms , 2011, IA3 '11.

[25] Albert Cohen,et al. Predictive modeling in a polyhedral optimization space , 2011, CGO 2011.

[26] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[27] Jason Cong,et al. High-Level Power Estimation and Low-Power Design Space Exploration for FPGAs , 2007, 2007 Asia and South Pacific Design Automation Conference.