论文信息 - Racetrack memory based logic design for in-memory computing

Racetrack memory based logic design for in-memory computing

In-memory computing has been demonstrated to be an efficient computing infrastructure in the big data era for many applications such as graph processing and encryption. The area and power overhead of CMOS technology based memory design is growing rapidly because of the increasing data capacity and leakage power along with the shrinking technology node. Thus, a newly introduced emerging memory technology, racetrack memory, is proposed to increase the data capacity and power efficiency of modern memory systems. As the design requirements of the conventional logic are different from that of the emerging memory based logic for in-memory computing, the conventional well-developed CMOS technology based logic designs are less relevant to the emerging memory based in-memory computing. Therefore, novel logic designs for racetrack memory are required. Traditional logic design with separate chips is focusing on high speed, which causes large area and power consumption. Implementing efficient logic design for in-memory computing is challenging due to the demanding requirement for area and power. Firstly, as the computing logic for in-memory computing is built in memory, the available area budget is limited, otherwise the data density of the memory system would be affected. Secondly, due to the thermal constraint of the memory chip, the available energy budget for computing logic design is limited. Large energy consumption may cause malfunction and even permanent damage to the memory chip because of high temperature. Finally, the adoption of emerging memory technologies makes the logic design more challenging due to their unique characteristics such as the sequential access mechanism of racetrack memory.

Tao Luo | Tao Luo

[1] Beng Chin Ooi,et al. In-Memory Big Data Management and Processing: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[2] Thomas Blum,et al. Montgomery modular exponentiation on reconfigurable hardware , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[3] William J. Dally,et al. Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[4] Akashi Satoh,et al. Systematic Design of RSA Processors Based on High-Radix Montgomery Multipliers , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5] Jeng-Shyang Pan,et al. Low-Complexity Digit-Serial and Scalable SPB/GPB Multipliers Over Large Binary Extension Fields Using (b,2)-Way Karatsuba Decomposition , 2014, IEEE Transactions on Circuits and Systems I: Regular Papers.

[6] K. Patel,et al. Implementing Digital Signature with RSA Encryption Algorithm to Enhance the Data Security of Cloud in Cloud Computing , 2016 .

[7] Christoforos E. Kozyrakis,et al. A case for intelligent RAM , 1997, IEEE Micro.

[8] Kaushik Roy,et al. TapeCache: a high density, energy efficient cache based on domain wall memory , 2012, ISLPED '12.

[9] Jacques-Olivier Klein,et al. Ultra Low Power Magnetic Flip-Flop Based on Checkpointing/Power Gating and Self-Enable Mechanisms , 2014, IEEE Transactions on Circuits and Systems I: Regular Papers.

[10] Z. Wei,et al. Highly reliable TaOx ReRAM and direct evidence of redox reaction mechanism , 2008, 2008 IEEE International Electron Devices Meeting.

[11] J. McCanny,et al. Modified Montgomery modular multiplication and RSA exponentiation techniques , 2004 .

[12] Michael J. Flynn,et al. Very high-speed computing systems , 1966 .

[13] Q. Stainer,et al. MRAM with soft reference layer: In-stack combination of memory and logic functions , 2013, 2013 5th IEEE International Memory Workshop.

[14] Cheng-Wen Wu,et al. An improved Montgomery's algorithm for high-speed RSA public-key cryptosystem , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[15] Weisheng Zhao,et al. Perpendicular-magnetic-anisotropy CoFeB racetrack memory , 2012 .

[16] Ehsan Atoofian,et al. Reducing shift penalty in Domain Wall Memory through register locality , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[17] Laurent Imbert,et al. Parallel Modular Multiplication on Multi-core Processors , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[18] Frederick T. Chen,et al. Low power and high speed bipolar switching with a thin reactive Ti buffer layer in robust HfO2 based RRAM , 2008, 2008 IEEE International Electron Devices Meeting.

[19] Onur Mutlu,et al. Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[20] Mohamad Towfik Krounbi,et al. Basic principles of STT-MRAM cell operation in memory arrays , 2013 .

[21] Luan Tran,et al. 45nm low power CMOS logic compatible embedded STT MRAM utilizing a reverse-connection 1T/1MTJ cell , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[22] Ming-Der Shieh,et al. Scalable Montgomery Modular Multiplication Architecture with Low-Latency and Low-Memory Bandwidth Requirement , 2014, IEEE Transactions on Computers.

[23] P. L. Montgomery. Modular multiplication without trial division , 1985 .

[24] C. Rettner,et al. Current-Controlled Magnetic Domain-Wall Nanowire Shift Register , 2008, Science.

[25] Joonyoung Kim,et al. HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[26] Mircea R. Stan,et al. Relaxing non-volatility for fast and energy-efficient STT-RAM caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[27] F. Pellizzer,et al. Optimization metrics for Phase Change Memory (PCM) cell architectures , 2014, 2014 IEEE International Electron Devices Meeting.

[28] Sanu Mathew,et al. An improved unified scalable radix-2 Montgomery multiplier , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[29] Chih-Wei Liu,et al. Design and Implementation of High-Speed and Energy-Efficient Variable-Latency Speculating Booth Multiplier (VLSBM) , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[30] S. O. Park,et al. Highly scalable nonvolatile resistive memory using simple binary oxide driven by asymmetric unipolar voltage pulses , 2004, IEDM Technical Digest. IEEE International Electron Devices Meeting, 2004..

[31] Jiwu Shu,et al. Exploring data placement in racetrack memory based scratchpad memory , 2015, 2015 IEEE Non-Volatile Memory System and Applications Symposium (NVMSA).

[32] Sunggu Lee,et al. Accelerating graph computation with racetrack memory and pointer-assisted graph representation , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33] Colin D. Walter,et al. Hardware Implementation of Montgomery's Modular Multiplication Algorithm , 1993, IEEE Trans. Computers.

[34] Kaushik Roy,et al. DWM-TAPESTRI - An energy efficient all-spin cache using domain wall shift based writes , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35] Massimiliano Di Ventra,et al. On the physical properties of memristive, memcapacitive and meminductive systems , 2013, Nanotechnology.

[36] Zhao Zhang,et al. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[37] D. H. Jacobsohn,et al. A Suggestion for a Fast Multiplier , 1964, IEEE Trans. Electron. Comput..

[38] Eric Pop,et al. Phase change materials and phase change memory , 2014 .

[39] Jacques-Olivier Klein,et al. Magnetic Adder Based on Racetrack Memory , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[40] F. Pellizzer,et al. Novel /spl mu/trench phase-change memory cell for embedded and stand-alone non-volatile memory applications , 2004, Digest of Technical Papers. 2004 Symposium on VLSI Technology, 2004..

[41] Vijayalakshmi Srinivasan,et al. Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[42] Eric Belhaire,et al. New non‐volatile logic based on spin‐MTJ , 2008 .

[43] Holger Orup,et al. Simplifying quotient determination in high-radix modular multiplication , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[44] Jean-Pierre Seifert,et al. A new CRT-RSA algorithm secure against bellcore attacks , 2003, CCS '03.

[45] Cong Xu,et al. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[46] Peter Kornerup. High-radix modular multiplication for cryptosystems , 1993, Proceedings of IEEE 11th Symposium on Computer Arithmetic.

[47] Mahmut T. Kandemir,et al. Leakage Current: Moore's Law Meets Static Power , 2003, Computer.

[48] Stuart S. P. Parkin,et al. Memory on the Racetrack , 2015 .

[49] Weisheng Zhao,et al. Low Power Magnetic Full-Adder Based on Spin Transfer Torque MRAM , 2013, IEEE Transactions on Magnetics.

[50] J. Thomas Pawlowski,et al. Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[51] Juan Carlos López,et al. Design and implementation of a coprocessor for cryptography applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[52] Per-Åke Larson,et al. Modern Main-Memory Database Systems , 2016, Proc. VLDB Endow..

[53] Ehsan Atoofian,et al. Shift-aware racetrack memory , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[54] Swaroop Ghosh,et al. Exploiting Serial Access and Asymmetric Read/Write of Domain Wall Memory for Area and Energy-Efficient Digital Signal Processor Design , 2016, IEEE Transactions on Circuits and Systems I: Regular Papers.

[55] Jean-Luc Gaudiot,et al. A Simple High-Speed Multiplier Design , 2006, IEEE Transactions on Computers.

[56] Ming-Der Shieh,et al. Word-Based Montgomery Modular Multiplication Algorithm for Low-Latency Scalable Architectures , 2010, IEEE Transactions on Computers.

[57] Keshab K. Parhi,et al. Design of low-error fixed-width modified booth multiplier , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[58] Ming-Der Shieh,et al. A New Modular Exponentiation Architecture for Efficient Design of RSA Cryptosystem , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[59] Çetin Kaya Koç,et al. High-Radix Design of a Scalable Modular Multiplier , 2001, CHES.

[60] Yue Zhang,et al. Ultra-High Density Content Addressable Memory Based on Current Induced Domain Wall Motion in Magnetic Track , 2012, IEEE Transactions on Magnetics.

[61] Akashi Satoh,et al. A Scalable Dual-Field Elliptic Curve Cryptographic Processor , 2003, IEEE Trans. Computers.

[62] Tarek A. El-Ghazawi,et al. New Hardware Architectures for Montgomery Modular Multiplication Algorithm , 2011, IEEE Transactions on Computers.

[63] Ming-Der Shieh,et al. A High-Performance Unified-Field Reconfigurable , 2010 .

[64] Keke Gai,et al. Phase-Change Memory Optimization for Green Cloud with Genetic Algorithm , 2015, IEEE Transactions on Computers.

[65] C. D. Walter,et al. Systolic Modular Multiplication , 1993, IEEE Trans. Computers.

[66] Yun Liang,et al. Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs , 2016, Journal of Computer Science and Technology.

[67] Alfred Menezes,et al. Guide to Elliptic Curve Cryptography , 2004, Springer Professional Computing.

[68] 裕幸飯田,et al. International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－ , 2004 .

[69] Jun Yang,et al. Exploit common source-line to construct energy efficient domain wall memory based caches , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[70] Jian-Ping Wang,et al. A spintronics full adder for magnetic CPU , 2005 .

[71] Duncan G. Elliott,et al. Computational RAM: Implementing Processors in Memory , 1999, IEEE Des. Test Comput..

[72] Scott Hauck,et al. Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[73] S. Parkin,et al. Magnetic Domain-Wall Racetrack Memory , 2008, Science.

[74] Rami G. Melhem,et al. Multilane Racetrack caches: Improving efficiency through compression and independent shifting , 2015, The 20th Asia and South Pacific Design Automation Conference.

[75] H. Ohno,et al. Fabrication of a Nonvolatile Full Adder Based on Logic-in-Memory Architecture Using Magnetic Tunnel Junctions , 2008 .

[76] Sparsh Mittal,et al. A Survey of Techniques for Architecting and Managing GPU Register File , 2017, IEEE Transactions on Parallel and Distributed Systems.

[77] Jaewook Shin,et al. Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[78] Adi Shamir,et al. A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[79] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[80] Duane Mills,et al. 19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[81] Yu Wang,et al. Hi-fi playback: Tolerating position errors in shift operations of racetrack memory , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[82] T. Kuroda. CMOS design challenges to power wall , 2001, Digest of Papers. Microprocesses and Nanotechnology 2001. 2001 International Microprocesses and Nanotechnology Conference (IEEE Cat. No.01EX468).

[83] Issa M. Khalil,et al. Cloud Computing Security: A Survey , 2014, Comput..

[84] Haomin Wu,et al. A new design of the CMOS full adder , 1992 .

[85] G. Finocchio,et al. A strategy for the design of skyrmion racetrack memories , 2014, Scientific Reports.

[86] Yajun Ha,et al. A Low Active Leakage and High Reliability Phase Change Memory (PCM) Based Non-Volatile FPGA Storage Element , 2014, IEEE Transactions on Circuits and Systems I: Regular Papers.

[87] Bernard Dieny,et al. Synchronous 8-bit Non-Volatile Full-Adder based on Spin Transfer Torque Magnetic Tunnel Junction , 2015, IEEE Transactions on Circuits and Systems I: Regular Papers.

[88] Tarek Darwish,et al. Performance analysis of low-power 1-bit CMOS full adder cells , 2002, IEEE Trans. Very Large Scale Integr. Syst..

[89] Kaushik Roy,et al. Energy-Efficient All-Spin Cache Hierarchy Using Shift-Based Writes and Multilevel Storage , 2015, ACM J. Emerg. Technol. Comput. Syst..

[90] Alexander Albicki,et al. Low power and high speed multiplication design through mixed number representations , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.

[91] Tolga Acar,et al. Analyzing and comparing Montgomery multiplication algorithms , 1996, IEEE Micro.

[92] Burton S. Kaliski,et al. The Montgomery Inverse and Its Applications , 1995, IEEE Trans. Computers.

[93] Fabien Clermidy,et al. Bipolar ReRAM Based non-volatile flip-flops for low-power architectures , 2012, 10th IEEE International NEWCAS Conference.

[94] Dejan Markovic,et al. True Energy-Performance Analysis of the MTJ-Based Logic-in-Memory Architecture (1-Bit Full Adder) , 2010, IEEE Transactions on Electron Devices.

[95] Craig Gentry,et al. Fully homomorphic encryption using ideal lattices , 2009, STOC '09.

[96] Hai Li,et al. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power , 2015, The 20th Asia and South Pacific Design Automation Conference.

[97] Yiran Chen,et al. Exploration of GPGPU register file architecture using domain-wall-shift-write based racetrack memory , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[98] Kaushik Roy,et al. STAG: Spintronic-Tape Architecture for GPGPU cache hierarchies , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[99] Sanghyeon Lee,et al. Enhanced cycling endurance in phase change memory via electrical control of switching induced atomic migration , 2014, 2014 14th Annual Non-Volatile Memory Technology Symposium (NVMTS).

[100] Orest J. Bedrij. Carry-Select Adder , 1962, IRE Trans. Electron. Comput..

[101] Kaushik Roy,et al. Cache Design with Domain Wall Memory , 2016, IEEE Transactions on Computers.

[102] M.-J. Hsiao,et al. Carry-select adder using single ripple-carry adder , 1998 .

[103] H-S Philip Wong,et al. Memory leads the way to better computing. , 2015, Nature nanotechnology.

[104] Victor S. Miller,et al. Use of Elliptic Curves in Cryptography , 1985, CRYPTO.

[105] Hao Yu,et al. Energy efficient in-memory AES encryption based on nonvolatile domain-wall nanowire , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[106] Yiran Chen,et al. Compiler-assisted refresh minimization for volatile STT-RAM cache , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[107] Wenqing Wu,et al. Cross-layer racetrack memory design for ultra high density and low power consumption , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[108] Amar Mandal,et al. Tripartite Modular Multiplication using Toom-Cook Multiplication , 2012 .

[109] Rami G. Melhem,et al. ContextPreRF: Enhancing the Performance and Energy of GPUs With Nonuniform Register Access , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[110] G. Huang,et al. An Energy-Efficient Nonvolatile In-Memory Computing Architecture for Extreme Learning Machine by Domain-Wall Nanowire Devices , 2015, IEEE Transactions on Nanotechnology.

[111] Mahmut T. Kandemir,et al. Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[112] Hari Balakrishnan,et al. CryptDB: protecting confidentiality with encrypted query processing , 2011, SOSP.

[113] Yi Gang,et al. A High-Reliability, Low-Power Magnetic Full Adder , 2011, IEEE Transactions on Magnetics.

[114] Kailash Gopalakrishnan,et al. Overview of candidate device technologies for storage-class memory , 2008, IBM J. Res. Dev..

[115] Yiran Chen,et al. Design of Last-Level On-Chip Cache Using Spin-Torque Transfer RAM (STT RAM) , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[116] Çetin Kaya Koç,et al. A Scalable Architecture for Montgomery Multiplication , 1999, CHES.

[117] G. Servalli,et al. A 45nm generation Phase Change Memory technology , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[118] Jun Yang,et al. A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[119] Sascha Vongehr,et al. The Missing Memristor has Not been Found , 2015, Scientific Reports.

[120] Çetin Kaya Koç,et al. A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm , 2003, IEEE Trans. Computers.

[121] Tian-Sheuan Chang,et al. A new RSA cryptosystem hardware design based on Montgomery's algorithm , 1998 .

[122] Anand Raghunathan,et al. Domain-Specific Many-core Computing using Spin-based Memory , 2014, IEEE Transactions on Nanotechnology.

[123] R. Schaller,et al. Moore's law: past, present and future , 1997 .

[124] Wenqing Wu,et al. Multi retention level STT-RAM cache designs with a dynamic refresh scheme , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[125] Haralampos Pozidis,et al. Recent Progress in Phase-Change Memory Technology , 2016, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[126] Makoto Motoyoshi,et al. Through-Silicon Via (TSV) , 2009, Proceedings of the IEEE.

[127] H.-S. Philip Wong,et al. Phase Change Memory , 2010, Proceedings of the IEEE.

[128] N. Koblitz. Elliptic curve cryptosystems , 1987 .

[129] Nisha Checka,et al. Technology, performance, and computer-aided design of three-dimensional integrated circuits , 2004, ISPD '04.

[130] Colin D. Walter. Space/Time Trade-Offs for Higher Radix Modular Multiplication Using Repeated Addition , 1997, IEEE Trans. Computers.