论文信息 - Liquid Silicon-Monona: A Reconfigurable Memory-Oriented Computing Fabric with Scalable Multi-Context Support

Liquid Silicon-Monona: A Reconfigurable Memory-Oriented Computing Fabric with Scalable Multi-Context Support

With the recent trend of promoting Field-Programmable Gate Arrays (FPGAs) to first-class citizens in accelerating compute-intensive applications in networking, cloud services and artificial intelligence, FPGAs face two major challenges in sustaining competitive advantages in performance and energy efficiency for diverse cloud workloads: 1) limited configuration capability for supporting light-weight computations/on-chip data storage to accelerate emerging search-/data-intensive applications. 2) lack of architectural support to hide reconfiguration overhead for assisting virtualization in a cloud computing environment. In this paper, we propose a reconfigurable memory-oriented computing fabric, namely Liquid Silicon-Monona (L-Si), enabled by emerging nonvolatile memory technology i.e. RRAM, to address these two challenges. Specifically, L-Si addresses the first challenge by virtue of a new architecture comprising a 2D array of physically identical but functionally-configurable building blocks. It, for the first time, extends the configuration capabilities of existing FPGAs from computation to the whole spectrum ranging from computation to data storage. It allows users to better customize hardware by flexibly partitioning hardware resources between computation and memory, greatly benefiting emerging search- and data-intensive applications. To address the second challenge, L-Si provides scalable multi-context architectural support to minimize reconfiguration overhead for assisting virtualization. In addition, we provide compiler support to facilitate the programming of applications written in high-level programming languages (e.g. OpenCL) and frameworks (e.g. TensorFlow, MapReduce) while fully exploiting the unique architectural capability of L-Si. Our evaluation results show L-Si achieves 99.6% area reduction, 1.43× throughput improvement and 94.0% power reduction on search-intensive benchmarks, as compared with the FPGA baseline. For neural network benchmarks, on average, L-Si achieves 52.3× speedup, 113.9× energy reduction and 81% area reduction over the FPGA baseline. In addition, the multi-context architecture of L-Si reduces the context switching time to - 10ns, compared with an off-the-shelf FPGA (∼100ms), greatly facilitating virtualization.

Jing Li | Yue Zha | Yue Zha | J. Li

[1] Ran El-Yaniv,et al. Binarized Neural Networks , 2016, NIPS.

[2] Yajun Ha,et al. A Low Active Leakage and High Reliability Phase Change Memory (PCM) Based Non-Volatile FPGA Storage Element , 2014, IEEE Transactions on Circuits and Systems I: Regular Papers.

[3] Malgorzata Marek-Sadowska,et al. Partitioning Sequential Circuits on Dynamically Reconfigurable FPGAs , 1999, IEEE Trans. Computers.

[4] André DeHon,et al. Location, location, location: the role of spatial locality in asymptotic energy minimization , 2013, FPGA '13.

[5] Eriko Nurvitadhi,et al. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 International Conference on Field-Programmable Technology (FPT).

[6] Robert K. Brayton,et al. Combinational and sequential mapping with priority cuts , 2007, 2007 IEEE/ACM International Conference on Computer-Aided Design.

[7] Ran Ginosar,et al. Resistive Associative Processor , 2015, IEEE Computer Architecture Letters.

[8] Jason Cong,et al. mrFPGA: A novel FPGA architecture with memristor-based reconfiguration , 2011, 2011 IEEE/ACM International Symposium on Nanoscale Architectures.

[9] Shinichi Yasuda,et al. A pure-CMOS nonvolatile multi-context configuration memory for dynamically reconfigurable FPGAs , 2014, International Conference on Field-Programmable Technology.

[10] Fadi J. Kurdahi,et al. A framework for reconfigurable computing: task scheduling and context management , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[11] Yu Zhang,et al. Enabling FPGAs in the cloud , 2014, Conf. Computing Frontiers.

[12] Philip Heng Wai Leong,et al. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[13] Ashish Goel,et al. Similarity search and locality sensitive hashing using ternary content addressable memories , 2010, SIGMOD Conference.

[14] Dave Brown,et al. Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[15] Jason Luu,et al. VPR 5.0: FPGA cad and architecture exploration tools with single-driver routing, heterogeneity and process scaling , 2009, FPGA '09.

[16] Eby G. Friedman,et al. AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[17] An Chen,et al. A Comprehensive Crossbar Array Model With Solutions for Line Resistance and Nonlinear Device Characteristics , 2013, IEEE Transactions on Electron Devices.

[18] Xi Chen,et al. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[19] Carl Ebeling,et al. Hardware Acceleration of Short Read Mapping , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[20] Bharat Sukhwani,et al. Accelerating Join Operation for Relational Databases with FPGAs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[21] Javier Resano,et al. Specific scheduling support to minimize the reconfiguration overhead of dynamically reconfigurable hardware , 2004, Proceedings. 41st Design Automation Conference, 2004..

[22] Rajesh Gupta,et al. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[23] Jacques-Olivier Klein,et al. MRAM crossbar based configurable logic block , 2012, 2012 IEEE International Symposium on Circuits and Systems.

[24] Yong Wang,et al. SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[25] Weirong Jiang. Scalable Ternary Content Addressable Memory implementation using FPGAs , 2013, Architectures for Networking and Communications Systems.

[26] Alberto Leon-Garcia,et al. FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[27] Paolo Ienne,et al. Virtualized Execution Runtime for FPGA Accelerators in the Cloud , 2017, IEEE Access.

[28] K. Pagiamtzis,et al. Content-addressable memory (CAM) circuits and architectures: a tutorial and survey , 2006, IEEE Journal of Solid-State Circuits.

[29] Meng-Fan Chang,et al. 17.5 A 3T1R nonvolatile TCAM using MLC ReRAM with Sub-1ns search time , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[30] Seth Copen Goldstein,et al. Configuration Caching and Swapping , 2001, FPL.

[31] Meng-Fan Chang,et al. 7.4 A 256b-wordlength ReRAM-based TCAM with 1ns search-time and 14× improvement in wordlength-energyefficiency-density product using 2.5T1R cell , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[32] Andrew S. Cassidy,et al. Convolutional networks for fast, energy-efficient neuromorphic computing , 2016, Proceedings of the National Academy of Sciences.

[33] Engin Ipek,et al. A resistive TCAM accelerator for data-intensive computing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34] Yukio Hayakawa,et al. An 8 Mb Multi-Layered Cross-Point ReRAM Macro With 443 MB/s Write Throughput , 2012, IEEE Journal of Solid-State Circuits.

[35] Kristen Grauman,et al. Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36] Tuo-Hung Hou,et al. Resistive random access memory (RRAM) technology: From material, device, selector, 3D integration to bottom-up fabrication , 2017, Journal of Electroceramics.

[37] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[38] Yiran Chen,et al. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[39] Dennis W. Prather,et al. FPGA-based acceleration of the 3D finite-difference time-domain method , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[40] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[41] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[42] Qinru Qiu,et al. SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing , 2016, ASPLOS.

[43] A. Tsai,et al. PipeRench: A virtualized programmable datapath in 0.18 micron technology , 2002, Proceedings of the IEEE 2002 Custom Integrated Circuits Conference (Cat. No.02CH37285).

[44] Kinam Kim,et al. A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O(5-x)/TaO(2-x) bilayer structures. , 2011, Nature materials.

[45] Steven Trimberger,et al. A time-multiplexed FPGA , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[46] Paris Smaragdis,et al. Bitwise Neural Networks , 2016, ArXiv.

[47] Yung-Hsiang Lu,et al. Large-scale Image Processing using Amazon EC2 Spot Instances , 2016, Image Quality and System Performance.

[48] Jeremy Buhler,et al. Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[49] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[50] Kenneth B. Kent,et al. VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling , 2011, TRETS.

[51] Zhiyuan Li,et al. Configuration caching management techniques for reconfigurable computing , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[52] S. Jo,et al. 3D-stackable crossbar resistive memory based on Field Assisted Superlinear Threshold (FAST) selector , 2014, 2014 IEEE International Electron Devices Meeting.

[53] Christoforos E. Kozyrakis,et al. Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[54] S. Yang,et al. Logic Synthesis and Optimization Benchmarks User Guide Version 3.0 , 1991 .

[55] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[56] Engin Ipek,et al. Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning , 2017, 2017 Fifth Berkeley Symposium on Energy Efficient Electronic Systems & Steep Transistors Workshop (E3S).

[57] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[58] Takahiro Hanyu,et al. Nonvolatile field-programmable gate array using 2-transistor–1-MTJ-cell-based multi-context array for power and area efficient dynamically reconfigurable logic , 2015 .

[59] Sung-Mo Kang,et al. RRAM-based TCAMs for pattern search , 2016, 2016 IEEE International Symposium on Circuits and Systems (ISCAS).

[60] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[61] R. Williams,et al. Sub-nanosecond switching of a tantalum oxide memristor , 2011, Nanotechnology.

[62] Shimeng Yu,et al. Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[63] Ryutaro Yasuhara,et al. Switching and reliability mechanisms for ReRAM , 2014, IEEE International Interconnect Technology Conference.

[64] James C. Hoe,et al. CoRAM: an in-fabric memory architecture for FPGA-based computing , 2011, FPGA '11.

[65] Henry Hoffmann,et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[66] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[67] Yusuf Leblebici,et al. GMS: Generic memristive structure for non-volatile FPGAs , 2012, 2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC).

[68] Jason Cong,et al. FPGA-RPI: A Novel FPGA Architecture With RRAM-Based Programmable Interconnects , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[69] Wei Zhang,et al. Non-volatile 3D stacking RRAM-based FPGA , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[70] Fadi J. Kurdahi,et al. Configuration management in multi-context reconfigurable systems for simultaneous performance and power optimizations , 2000, ISSS '00.

[71] Z. Wei,et al. Highly reliable TaOx ReRAM and direct evidence of redox reaction mechanism , 2008, 2008 IEEE International Electron Devices Meeting.

[72] Hari Angepat,et al. A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[73] André DeHon,et al. DPGA Utilization and Application , 1996, Fourth International ACM Symposium on Field-Programmable Gate Arrays.

[74] Antonio Torralba,et al. Decoder-driven switching matrices in multicontext FPGAs: area reduction and their effect on routability , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[75] Kermin Fleming,et al. Leap scratchpads: automatic memory and cache management for reconfigurable logic , 2010, FPGA '11.

[76] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[77] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[78] Kizheppatt Vipin,et al. Virtualized FPGA Accelerators for Efficient Cloud Computing , 2015, 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom).

[79] Jing Li,et al. 1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing , 2014, IEEE Journal of Solid-State Circuits.

[80] Zhe Wang,et al. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[81] Jonathan S. Turner,et al. ClassBench: A Packet Classification Benchmark , 2005, IEEE/ACM Transactions on Networking.

[82] Wei Zhang,et al. Melia: A MapReduce Framework on OpenCL-Based FPGAs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[83] Yuichiro Shibata,et al. A prototype chip of multicontext FPGA with DRAM for virtual hardware , 2001, ASP-DAC '01.

[84] Jari Nurmi,et al. Static scheduling techniques for dependent tasks on dynamically reconfigurable devices , 2007, J. Syst. Archit..

[85] Sen Wang,et al. VTR 7.0: Next Generation Architecture and CAD System for FPGAs , 2014, TRETS.

[86] Yoshua Bengio,et al. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[87] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[88] Jing Li,et al. Reconfigurable in-memory computing with resistive memory crossbar , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[89] R. Njuguna. A Survey of FPGA Benchmarks , 2008 .

[90] Jung Ho Ahn,et al. DRAMA: An Architecture for Accelerated Processing Near Memory , 2015, IEEE Computer Architecture Letters.

[91] Francky Catthoor,et al. A hybrid prefetch scheduling heuristic to minimize at run-time the reconfiguration overhead of dynamically reconfigurable hardware [multimedia applications] , 2005, Design, Automation and Test in Europe.

[92] Peilin Song,et al. 1Mb 0.41 µm2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing , 2013, 2013 Symposium on VLSI Circuits.

[93] Jing Li,et al. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[94] Abbas El Gamal,et al. Nonvolatile 3D-FPGA with monolithically stacked RRAM-based configuration memory , 2012, 2012 IEEE International Solid-State Circuits Conference.

[95] Patrick Pantel,et al. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[96] Rainer G. Spallek,et al. RC3E: Provision and Management of Reconfigurable Hardware Accelerators in a Cloud Environment , 2015, ArXiv.