Hardware accelerator design for data centers

As the size of available data is increasing, it is becoming inefficient to scale the computational power of traditional systems. To overcome this problem, customized application-specific accelerators are becoming integral parts of modern system on chip (SOC) architectures. In this paper, we summarize existing hardware accelerators for data centers and discuss the techniques to implement and embed them along with the existing SOCs.

[1]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, FCCM 2014.

[2]  Jason Cong,et al.  Optimization of interconnects between accelerators and shared memories in dark silicon , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[3]  Hideharu Amano,et al.  A Performance Evaluation of CUBE: One-Dimensional 512 FPGA Cluster , 2010, ARC.

[4]  Luka Daoud,et al.  A Survey of High Level Synthesis Languages, Tools, and Compilers for Reconfigurable High Performance Computing , 2013, ICSS.

[5]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[6]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[7]  Jason Cong,et al.  CHARM: a composable heterogeneous accelerator-rich microprocessor , 2012, ISLPED '12.

[8]  Mikko H. Lipasti,et al.  BenchNN: On the broad potential application scope of hardware neural network accelerators , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[10]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[11]  Rob A. Rutenbar,et al.  FPGA acceleration of Markov Random Field TRW-S inference for stereo matching , 2013, 2013 Eleventh ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2013).

[12]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[13]  Scott A. Mahlke,et al.  Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  W. Luk,et al.  Axel: a heterogeneous cluster with FPGAs and GPUs , 2010, FPGA '10.

[15]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[16]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[17]  Scott A. Mahlke,et al.  VEAL: Virtualized Execution Accelerator for Loops , 2008, 2008 International Symposium on Computer Architecture.

[18]  Christopher Batten,et al.  Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Jasmine Novak,et al.  PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .

[20]  J.M. Perez,et al.  High memory throughput FPGA architecture for high-definition Belief-Propagation stereo matching , 2009, 2009 3rd International Conference on Signals, Circuits and Systems (SCS).

[21]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[22]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Jianlong Zhong,et al.  Medusa: A Parallel Graph Processing System on Graphics Processors , 2014, SGMD.

[24]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[25]  MutluOnur,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015 .

[26]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[27]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[28]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[29]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[30]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[31]  Séamas McGettrick,et al.  An FPGA architecture for the Pagerank eigenvector problem , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[32]  Paul Chow,et al.  ZCluster: A Zynq-based Hadoop cluster , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[33]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[34]  Luca P. Carloni,et al.  An analysis of accelerator coupling in heterogeneous architectures , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[35]  Feng Liu,et al.  CGPA: Coarse-Grained Pipelined Accelerators , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[37]  Bingsheng He,et al.  Parallel Data Mining on Graphics Processors , 2011 .

[38]  Yu Wang,et al.  FPMR: MapReduce framework on FPGA , 2010, FPGA '10.

[39]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[40]  Tsutomu Yoshinaga,et al.  An FPGA-Based Tightly Coupled Accelerator for Data-Intensive Applications , 2014, 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs.

[41]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[42]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[43]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).