A Novel DSP Architecture for Scientific Computing and Deep Learning

Exascale computing requires accelerators with ultrahigh power efficiency. Digital signal processors (DSPs), the most important embedded processors widely known for high power efficiency, are rarely explored in the HPC community. We propose a 64-bit general purpose DSP architecture, FT-Matrix2000, which not only integrates the main features of DSPs but also presents several novel enhancements for scientific computing. The FT-Matrix2000 architecture comprises multiple FT-Matrix2 cores and optional RISC CPU cores. The FT-Matrix2 core utilizes a VLIW+SIMD architecture, provides support for double precision operations, and optimizes both the data and control path for scientific computing. Our evaluations show that the performance and efficiency of FT-Matrix2000 are 1107GFLOPS and 92.25%. Compared with the MIC and a 40nm process GPU, FT-Matrix2000 improves the GEMM power efficiency with a factor of 1.49 and 2.68, respectively. We build up a prototype supercomputer with FT-Matrix2000/12. Its HPL efficiency achieves 62.2%, and the performance power ratio is 5.33 GFLOPS/W, which can rank the fourth in the latest Green500 list. These results validate that the FT-Matrix2000 architecture is suitable for scientific computing while maintaining the efficiency of signal processing well. Moreover, the enhancement of FT-Matrix2000 in vector and matrix related computations also enable it to efficiently support deep learning related applications. We have implemented some typical DCNN models on FT-Matrx2000, NVIDIA GPUs, and Vision P6 DSP. The experiments demonstrate that the average computation efficiency of the proposed architecture based on Matrix2000 is about 20 ~ 35% and 8% higher respectively than GPUs and Cadence Vision P6 DSP.

[1]  Xiaohui Liu,et al.  A Composite Model of Wound Segmentation Based on Traditional Methods and Deep Neural Networks , 2018, Comput. Intell. Neurosci..

[2]  W. Brown Synthetic Aperture Radar , 1967, IEEE Transactions on Aerospace and Electronic Systems.

[3]  Zenghui Wang,et al.  Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review , 2017, Neural Computation.

[4]  Jianping Yin,et al.  A fast and accurate method for detecting fingerprint reference point , 2016, Neural Computing and Applications.

[5]  Miriam Leeser,et al.  Division and square root: choosing the right implementation , 1997, IEEE Micro.

[6]  S. Walther A unified algorithm for elementary functions , 1899 .

[7]  Farid Melgani,et al.  Convolutional SVM Networks for Object Detection in UAV Imagery , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[8]  Peter Xiaoping Liu,et al.  Robust Fuzzy Adaptive Tracking Control for Nonaffine Stochastic Nonlinear Switching Systems , 2018, IEEE Transactions on Cybernetics.

[9]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[10]  Scott A. Mahlke,et al.  D2MA: Accelerating coarse-grained data transfer for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[11]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[14]  Tianzhou Chen,et al.  Less reused filter: improving l2 cache performance via filtering less reused lines , 2009, ICS '09.

[15]  Robert A. van de Geijn,et al.  Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[17]  Shuming Chen,et al.  Accelerating the data shuffle operations for FFT algorithms on SIMD DSPs , 2011, 2011 9th IEEE International Conference on ASIC.

[18]  Sandip Parikh,et al.  High performance DSP for vision, imaging and neural networks , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[19]  Randi Thomas An Architectural Performance Study of the Fast Fourier Transform on Vector IRAM , 2000 .

[20]  Peter Xiaoping Liu,et al.  Adaptive Neural Output-Feedback Control for a Class of Nonlower Triangular Nonlinear Systems With Unmodeled Dynamics , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Yannis Smaragdakis,et al.  Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[22]  Jungwon Kim,et al.  Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes , 2015, IEEE Transactions on Parallel and Distributed Systems.

[23]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[24]  Shuming Chen,et al.  FT-Matrix: A Coordination-Aware Architecture for Signal Processing , 2014, IEEE Micro.

[25]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[26]  Peng Shi,et al.  Fuzzy Adaptive Control Design and Discretization for a Class of Nonlinear Uncertain Systems , 2016, IEEE Transactions on Cybernetics.

[27]  Javier D. Bruguera,et al.  Floating-point multiply-add-fused with reduced latency , 2004, IEEE Transactions on Computers.

[28]  Jong Won Park Multiaccess Memory System for Attached SIMD Computer , 2004, IEEE Trans. Computers.

[29]  Yongmin Kim,et al.  Efficient 2D FFT implementation on mediaprocessors , 2003, Parallel Comput..

[30]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[31]  Hongmin Li,et al.  Fuzzy-Approximation-Based Adaptive Output-Feedback Control for Uncertain Nonsmooth Nonlinear Systems , 2018, IEEE Transactions on Fuzzy Systems.