On the design of scalable and reusable accelerators for big data applications

Accelerators are becoming key elements of computing platforms for both data centers and mobile devices as they deliver energy-efficient high performance for key computational kernels. However, the design and integration of such components is complex, especially for Big Data applications where they have very large workloads to elaborate. Properly customizing the accelerators' private local memories (PLMs) is of critical importance. To analyze this problem we design an accelerator for Collaborative Filtering by applying a system-level design methodology that allows us to synthesize many alternative micro-architectures as we vary the PLM sizes. We then evaluate the resulting accelerators in terms of resource requirements for both embedded architectures and data centers as we vary the size and density of the workloads.

[1]  Melanie Kambadur,et al.  An experimental survey of energy management across the stack , 2014, OOPSLA.

[2]  Birgit Wirtz,et al.  Reuse Methodology Manual For System On A Chip Designs , 2016 .

[3]  Chen-Yong Cher,et al.  A wire-speed powerTM processor: 2.3GHz 45nm SOI with 16 cores and 64 threads , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[4]  Luca P. Carloni,et al.  High-level synthesis of accelerators in embedded scalable platforms , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[5]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[6]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[8]  Bart Selman,et al.  Referral Web: combining social networks and collaborative filtering , 1997, CACM.

[9]  Chih-Cheng Chen,et al.  10.3 heterogeneous multi-processing quad-core CPU and dual-GPU design for optimal performance, power, and thermal tradeoffs in a 28nm mobile application processor , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[10]  Luca P. Carloni,et al.  An analysis of accelerator coupling in heterogeneous architectures , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[12]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[13]  Daniel Aarno,et al.  Software and System Development using Virtual Platforms: Full-System Simulation with Wind River Simics , 2014 .

[14]  Pierre Bricaud,et al.  Reuse methodology manual for system-on-chip designs , 1998 .

[15]  Luca P. Carloni,et al.  Compositional system-level design exploration with planning of high-level synthesis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[16]  Vikram Bhatt,et al.  The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.

[17]  Paul Chow,et al.  High-Performance Reconfigurable Hardware Architecture for Restricted Boltzmann Machines , 2010, IEEE Transactions on Neural Networks.

[18]  Michel Paindavoine,et al.  Efficient Data Encoding for Convolutional Neural Network application , 2015, ACM Trans. Archit. Code Optim..

[19]  Saturnino Garcia,et al.  CortexSuite: A synthetic brain benchmark suite , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Luca P. Carloni,et al.  From Latency-Insensitive Design to Communication-Based System-Level Design , 2015, Proceedings of the IEEE.

[21]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[22]  Alessandro Foi,et al.  Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering , 2007, IEEE Transactions on Image Processing.

[23]  Luca P. Carloni,et al.  System-level memory optimization for high-level synthesis of component-based SoCs , 2014, 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[24]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[25]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[26]  Lei Zhang,et al.  The data deluge: Challenges and opportunities of unlimited data in statistical signal processing , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Jacob Nelson,et al.  SNNAP: Approximate computing on programmable SoCs via neural acceleration , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[28]  Gu-Yeon Wei,et al.  The accelerator store: A shared memory framework for accelerator-based systems , 2012, TACO.

[29]  裕幸 飯田,et al.  International Technology Roadmap for Semiconductors 2003の要求清浄度について - シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について - , 2004 .

[30]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.