Gene-Patterns: Should Architecture be Customized for Each Application?

Providing architectural support is crucial for newly arising applications to achieve high performance and high system efficiency. Currently there is a trend in designing accelerators for special applications, while arguably a debate is sparked whether we should customize architecture for each application. In this study, we introduce what we refer to as Gene-Patterns, which are the base patterns of diverse applications. We present a Recursive Reduce methodology to identify the hotspots, and a HOtspot Trace Suite (HOTS) is provided for the research community. We first extract the hotspot patterns, and then, remove the redundancy to obtain the base patterns. We find that although the number of applications is huge and ever-increasing, the amount of base patterns is relatively small, due to the similarity among the patterns of diverse applications. The similarity stems not only from the algorithms but also from the data structures. We build the Periodic Table of Memory Access Patterns (PT-MAP), where the indifference curves are analogous to the energy levels in physics, and memory performance optimization is essentially an energy level transition. We find that inefficiency results from the mismatch between some of the base patterns and the micro-architecture of modern processors. We have identified the key micro-architecture demands of the base patterns. The Gene-Pattern concept, methodology, and toolkit will facilitate the design of both hardware and software for the matching between architectures and applications.

[1]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Kevin Skadron,et al.  CMP design space exploration subject to physical constraints , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[3]  Jack Dongarra,et al.  Using PAPI for Hardware Performance Monitoring on Linux Systems , 2001 .

[4]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[5]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Ian Gorton,et al.  The Changing Paradigm of Data-Intensive Computing , 2009, Computer.

[7]  Benjamin C. Lee,et al.  REF: resource elasticity fairness with sharing incentives for multiprocessors , 2014, ASPLOS.

[8]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[9]  Yuhang Liu,et al.  CaL: Extending Data Locality to Consider Concurrency for Performance Optimization , 2018, IEEE Transactions on Big Data.

[10]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[11]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[12]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Babak Falsafi,et al.  Toward Dark Silicon in Servers , 2011, IEEE Micro.

[14]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[15]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.

[16]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[18]  Dawei Wang,et al.  APC: A Novel Memory Metric and Measurement Methodology for Modern Memory Systems , 2014, IEEE Transactions on Computers.

[19]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[20]  Dave Bergeron,et al.  More than Moore , 2008, CICC.

[21]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[22]  Guoqi Zhang,et al.  More than Moore: Creating High Value Micro/Nanoelectronics Systems , 2009 .

[23]  Cloyce D. Spradling SPEC CPU2006 benchmark tools , 2007, CARN.

[24]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[25]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[27]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[29]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[30]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[31]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[32]  William B. March,et al.  MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[33]  Apan Qasem,et al.  Balancing Locality and Parallelism on Shared-cache Mulit-core Systems , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[34]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.