Exploring performance and energy tradeoffs for irregular applications: A case study on the Tilera many-core architecture

High performance, parallel applications with irregular data accesses are becoming a critical workload class for modern systems. In particular, the execution of such workloads on emerging many-core systems is expected to be a significant component of applications in data mining, machine learning, scientific computing and graph analytics. However, power and energy constraints limit the capabilities of individual cores, memory hierarchy and on-chip interconnect of such systems, thus leading to architectural and software trade-offs that must be understood in the context of the intended applications behavior. Irregular applications are notoriously hard to optimize given their data-dependent access patterns, lack of structured locality and complex data structures and code patterns.We have ported two irregular applications, graph community detection using the Louvain method (Grappolo) and high-performance conjugate gradient (HPCCG), to the Tilera many-core system and have conducted a detailed study of platform-independent and platform-specific optimizations that improve their performance as well as reduce their overall energy consumption. To conduct this study, we employ an auto-tuning based approach that explores the optimization design space along three dimensionsmemory layout schemes, GCC compiler flag choices and OpenMP loop scheduling options. We leverage MITs OpenTuner auto-tuning framework to explore and recommend energy optimal choices for different combinations of parameters. We then conduct an in-depth architectural characterization to understand the memory behavior of the selected workloads. Finally, we perform a correlation study to demonstrate the interplay between the hardware behavior and application characteristics.Using auto-tuning, we demonstrate whole-node energy savings and performance improvements of up to 49.6% and 60% relative to a baseline instantiation, and up to 31% and 45.4% relative to manually optimized variants. Optimizing irregular applications for modern many-core architectures is challenging.Study energy and performance efficiency on Tilera using two irregular applications.Using auto-tuning, we explore the optimization design space along three dimensions.Show whole-node energy savings and performance improvements of up to 49.6% and 60%.

[1]  Viktor K. Prasanna,et al.  Rapid design space exploration of heterogeneous embedded systems using symbolic search and multi-granular simulation , 2002, LCTES/SCOPES '02.

[2]  Jian Li,et al.  Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[3]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[4]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[5]  Nicolai M. Josuttis The C++ Standard Library: A Tutorial and Reference , 2012 .

[6]  Ananta Tiwari,et al.  Auto-tuning for Energy Usage in Scientific Applications , 2011, Euro-Par Workshops.

[7]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[8]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[9]  Jason Helge Anderson,et al.  The Effect of Compiler Optimizations on High-Level Synthesis for FPGAs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[10]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11]  Kevin Skadron,et al.  Performance, energy, and thermal considerations for SMT and CMP architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[12]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  Anantharaman Kalyanaraman,et al.  Scaling graph community detection on the Tilera many-core architecture , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[14]  Sameer Kulkarni,et al.  Mitigating the compiler optimization phase-ordering problem using machine learning , 2012, OOPSLA '12.

[15]  Vittorio Zaccaria,et al.  Multicube Explorer: An Open Source Framework for Design Space Exploration of Chip Multi-Processors , 2010, ARCS Workshops.

[16]  Alan B. Williams,et al.  Poster: mini-applications: vehicles for co-design , 2011, SC '11 Companion.

[17]  Changjun Wu,et al.  pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs , 2012, IEEE Transactions on Parallel and Distributed Systems.

[18]  Joseph Manzano,et al.  Optimizing irregular applications for energy and performance on the Tilera many-core architecture , 2015, Conf. Computing Frontiers.

[19]  John Shalf,et al.  Exascale Operating Systems and Runtime Software Report , 2012 .

[20]  Ramon Canal,et al.  Design space exploration for multicore architectures: a power/performance/thermal view , 2006, ICS '06.

[21]  John Cavazos,et al.  Energy Auto-Tuning using the Polyhedral Approach , 2014 .

[22]  Simone Secchi,et al.  Special Issue on Architectures and Algorithms for Irregular Applications (AAIA) - Guest editors' introduction , 2015, J. Parallel Distributed Comput..

[23]  Katherine Yelick,et al.  The Optimized Sparse Kernel Interface (OSKI) Library User's Guide for Version 1.0.1h , 2007 .

[24]  Simone Secchi,et al.  Irregular applications: architectures & algorithms , 2011, IA3 '11.

[25]  Albert Cohen,et al.  Predictive modeling in a polyhedral optimization space , 2011, CGO 2011.

[26]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[27]  Jason Cong,et al.  High-Level Power Estimation and Low-Power Design Space Exploration for FPGAs , 2007, 2007 Asia and South Pacific Design Automation Conference.