In-datacenter performance analysis of a tensor processing unit

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X–30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X–80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

David A. Patterson | Norman P. Jouppi | Kyle Lucke | James Law | Jeffrey Dean | Doe Hyun Yoon | Eric Wilcox | Naveen Kumar | Dan Steinberg | Cliff Young | Horia Toma | James Laudon | Robert Hundt | Julian Ibarz | Thomas Norrie | Gregory Sizikov | Steve Lacy | Emad Samadiani | Richard Walter | Robert Hagmann | Mark Omernick | Gaurav Agrawal | Rick Boyle | Vijay Vasudevan | Bo Tian | Nishant Patil | Adriana Maggiore | Matt Ross | Daniel Killebrew | Andy Phelps | Alexander Kaplan | Kieran Miller | Ravi Narayanaswami | John Hu | Raminder Bajwa | Sarah Bates | Suresh Bhatia | Nan Boden | Al Borchers | Pierre-luc Cantin | Clifford Chao | Chris Clark | Jeremy Coriell | Mike Daley | Matt Dau | Ben Gelb | Tara Vazir Ghaemmaghami | Rajendra Gottipati | William Gulland | C. Richard Ho | Doug Hogberg | Dan Hurt | Aaron Jaffey | Alek Jaworski | Harshit Khaitan | Andy Koch | Diemthu Le | Chris Leary | Zhuyuan Liu | Alan Lundin | Gordon MacKean | Maire Mahony | Rahul Nagarajan | Ray Ni | Kathy Nix | Narayana Penukonda | Jonathan Ross | Amir Salek | Chris Severn | Matthew Snelham | Jed Souter | Andy Swing | Mercedes Tan | Gregory Thorson | Erick Tuttle | Walter Wang | Kyle A. Lucke | J. Dean | Vijay Vasudevan | Naveen Kumar | Julian Ibarz | N. Jouppi | C. Young | Nishant Patil | David A. Patterson | Gaurav Agrawal | R. Bajwa | Sarah Bates | Suresh Bhatia | N. Boden | Al Borchers | Rick Boyle | Pierre-luc Cantin | Clifford Chao | Chris Clark | Jeremy Coriell | Mike Daley | Matt Dau | Ben Gelb | T. Ghaemmaghami | R. Gottipati | William Gulland | R. Hagmann | C. R. Ho | Doug Hogberg | John Hu | R. Hundt | D. Hurt | A. Jaffey | Alek Jaworski | Alexander Kaplan | Harshit Khaitan | Daniel Killebrew | A. Koch | Steve Lacy | J. Laudon | James Law | Diemthu Le | Chris Leary | Zhuyuan Liu | Alan Lundin | G. MacKean | A. Maggiore | Maire Mahony | K. Miller | R. Nagarajan | Ravi Narayanaswami | Ray Ni | K. Nix | Thomas Norrie | Mark Omernick | Narayana Penukonda | A. Phelps | Jonathan Ross | Matt Ross | Amir Salek | E. Samadiani | C. Severn | G. Sizikov | Matthew Snelham | Jed Souter | D. Steinberg | Andy Swing | Mercedes Tan | G. Thorson | Bo Tian | H. Toma | Erick Tuttle | Richard Walter | Walter Wang | Eric Wilcox | D. Yoon | M. Mahony | Andy Phelps | Gregory Thorson | Taraneh Ghaemmaghami | Horia Toma | DONG-HYUN Hwang | Norman P. Jouppi | David Patterson | Gaurav Agrawal | Raminder Bajwa | Sarah Bates | Suresh Bhatia | Nan Boden | Al Borchers | Rick Boyle | Pierre-luc Cantin | Clifford Chao | Chris Clark | Jeremy Coriell | Mike Daley | Matt Dau | Jeffrey Dean | Ben Gelb | Rajendra Gottipati | William Gulland | Robert Hagmann | C. Richard Ho | Doug Hogberg | John Hu | Dan Hurt | Julian Ibarz | Alek Jaworski | Alexander Kaplan | Harshit Khaitan | Andy Koch | Naveen Kumar | Steve Lacy | James Laudon | James Law | Diemthu Le | Chris Leary | Zhuyuan Liu | Kyle Lucke | Alan Lundin | Gordon MacKean | Adriana Maggiore | Maire Mahony | Kieran Miller | Ray Ni | Kathy Nix | Andy Phelps | Jonathan Ross | Matt Ross | Amir Salek | Emad Samadiani | Chris Severn | Gregory Sizikov | Matthew Snelham | Jed Souter | Dan Steinberg | Mercedes Tan | Gregory Thorson | Bo Tian | Horia Toma | Erick Tuttle | Vijay Vasudevan | Richard Walter | Walter Wang | Eric Wilcox | Doe Hyun Yoon

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  David A. Patterson,et al.  The case for the reduced instruction set computer , 1980, CARN.

[3]  H. T. Kung Why systolic architectures? , 1982, Computer.

[4]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[5]  D. Hammerstrom,et al.  A VLSI architecture for high-performance, low-cost, on-chip learning , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[6]  J. Beichter,et al.  Design of a 1st Generation Neurocomputer , 1991 .

[7]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[8]  Paolo Ienne,et al.  Special-purpose digital hardware for neural networks: An architectural survey , 1996, J. VLSI Signal Process..

[9]  Michele Ruggiero Banish,et al.  Neural network processor , 2004, SPIE Optics + Photonics.

[10]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[11]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[12]  Yann LeCun,et al.  CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  Klaus-Dieter Lange,et al.  Identifying Shades of Green: The SPECpower Benchmarks , 2009, Computer.

[15]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[16]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Henk Corporaal,et al.  Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[19]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[20]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[21]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[22]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[24]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[25]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[26]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[27]  Karin Strauss,et al.  Toward accelerating deep learning at scale using specialized hardware in the datacenter , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[28]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[29]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30]  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[31]  Luca Benini,et al.  Origami: A Convolutional Network Accelerator , 2015, ACM Great Lakes Symposium on VLSI.

[32]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Eric S. Chung,et al.  A reconfigurable fabric for accelerating large-scale datacenter services , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[34]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[35]  雷吉纳尔德·克利福德·扬 Batch processing in a neural network processor , 2016 .

[36]  Jeffrey Dean,et al.  Large-Scale Deep Learning For Building Intelligent Computer Systems , 2016, WSDM.

[37]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[38]  格雷戈里·米歇尔·索尔森,et al.  Vector computation unit in a neural network processor , 2016 .

[39]  Yu Wang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[40]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[41]  Kurt Keutzer,et al.  If I could only design one circuit ...: technical perspective , 2016, Communications of the ACM.

[42]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[43]  Lin Zhong,et al.  RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[44]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[45]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[46]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[48]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[49]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[50]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[51]  Gu-Yeon Wei,et al.  Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[52]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[53]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[54]  Dong Han,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[55]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[56]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[57]  Kurt Keutzer Technical Perspective: If I could only design one circuit … , 2016 .

[58]  Krste Asanovi´c Programmable Neurocomputing , .

[59]  Wang,et al.  In-Datacenter Performance Analysis of a Tensor Processing UnitTM , .