论文信息 - In-datacenter performance analysis of a tensor processing unit

In-datacenter performance analysis of a tensor processing unit

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X–30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X–80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

[1] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[2] David A. Patterson,et al. The case for the reduced instruction set computer , 1980, CARN.

[3] H. T. Kung. Why systolic architectures? , 1982, Computer.

[4] James E. Smith,et al. Decoupled access/execute computer architectures , 1984, TOCS.

[5] D. Hammerstrom,et al. A VLSI architecture for high-performance, low-cost, on-chip learning , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[6] J. Beichter,et al. Design of a 1st Generation Neurocomputer , 1991 .

[7] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[8] Paolo Ienne,et al. Special-purpose digital hardware for neural networks: An architectural survey , 1996, J. VLSI Signal Process..

[9] Michele Ruggiero Banish,et al. Neural network processor , 2004, SPIE Optics + Photonics.

[10] David A. Patterson,et al. Latency lags bandwith , 2004, CACM.

[11] Luiz André Barroso,et al. The Case for Energy-Proportional Computing , 2007, Computer.

[12] Yann LeCun,et al. CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[13] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14] Klaus-Dieter Lange,et al. Identifying Shades of Green: The SPECpower Benchmarks , 2009, Computer.

[15] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[16] Berin Martini,et al. NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[17] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18] Henk Corporaal,et al. Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[19] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[20] Christoforos E. Kozyrakis,et al. Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[21] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[22] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[24] Karin Strauss,et al. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[25] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[26] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[27] Karin Strauss,et al. Toward accelerating deep learning at scale using specialized hardware in the datacenter , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[28] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[29] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30] PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[31] Luca Benini,et al. Origami: A Convolutional Network Accelerator , 2015, ACM Great Lakes Symposium on VLSI.

[32] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Eric S. Chung,et al. A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[34] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[35] 雷吉纳尔德·克利福德·扬. Batch processing in a neural network processor , 2016 .

[36] Jeffrey Dean,et al. Large-Scale Deep Learning For Building Intelligent Computer Systems , 2016, WSDM.

[37] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[38] 格雷戈里·米歇尔·索尔森,et al. Vector computation unit in a neural network processor , 2016 .

[39] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[40] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[41] Kurt Keutzer,et al. If I could only design one circuit ...: technical perspective , 2016, Communications of the ACM.

[42] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[43] Lin Zhong,et al. RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[44] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[45] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46] Hari Angepat,et al. A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[48] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[49] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[50] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[51] Gu-Yeon Wei,et al. Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[52] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[53] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[54] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[55] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[56] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[57] Kurt Keutzer. Technical Perspective: If I could only design one circuit … , 2016 .

[58] Krste Asanovi´c. Programmable Neurocomputing , .

[59] Wang,et al. In-Datacenter Performance Analysis of a Tensor Processing UnitTM , .