10x10: A Case Study in Highly-Programmable and Energy-Efficient Heterogeneous Federated Architecture

Customized architecture is widely recognized as an important approach for improved performance and energyefficiency. To balance generality and customization benefit, researchers have proposed to federate heterogeneous micro-engines. Using the 10x10 architecture and an integrated image and vision benchmark as a case study, we explore the performance and energy benefits achievable. Results for current 32nm technology and DDR3 memory show 10x10 architecture benefits of 140x performance and 72x energy overall. Adding 3D-stacked DRAM increase benefits to 171x (performance) and 100x (energy). Finally, considering future 7nm transistor process, benefits as large as 597x (performance) and 137x energy are observed.

[1]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[2]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[3]  Franz Franchetti,et al.  Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.

[4]  Andrew A. Chien,et al.  Calibrating the Relationship between Hardware Customization and Energy Eff ic iency , 2013 .

[5]  Luca Benini,et al.  Exploring architectural heterogeneity in intelligent vision systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6]  Hubertus Franke,et al.  A taxonomy of accelerator architectures and their programming models , 2010, IBM J. Res. Dev..

[7]  Yao Zhang,et al.  Systematic evaluation of workload clustering for extremely energy-efficient architectures , 2013, CARN.

[8]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[9]  Andrew A. Chien,et al.  10x10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency , 2011, ICCS.

[10]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[11]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[12]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  John Kim,et al.  Designing on-chip networks for throughput accelerators , 2013, ACM Trans. Archit. Code Optim..

[15]  David Moloney,et al.  Myriad 2: Eye of the computational vision storm , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[16]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[17]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[18]  Andrew A. Chien,et al.  Performance and energy limits of a processor-integrated FFT accelerator , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[19]  Andrew A. Chien,et al.  Does arithmetic logic dominate data movement? a systematic comparison of energy-efficiency for FFT accelerators , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[20]  Andrew A. Chien,et al.  A Data Layout Transformation (DLT) accelerator: Architectural support for data movement optimization in accelerated-centric heterogeneous systems , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Andrew A. Chien,et al.  Generalized Pattern Matching Micro-Engine , 2014 .