论文信息 - PRINS: Processing-in-Storage Acceleration of Machine Learning

PRINS: Processing-in-Storage Acceleration of Machine Learning

Machine learning algorithms have become a major tool in various applications. The high-performance requirements on large-scale datasets pose a challenge for traditional von Neumann architectures. We present two machine learning implementations and evaluations on PRINS, a novel processing-in-storage system based on resistive content addressable memory (ReCAM). PRINS functions simultaneously as a storage and a massively parallel associative processor. PRINS processing-in-storage resolves the bandwidth wall faced by near-data von Neumann architectures, such as three-dimensional DRAM and CPU stack or SSD with embedded CPU, by keeping the computing inside the storage arrays, thus implementing in-data, rather than near-data, processing. We show that PRINS-based processing-in-storage architecture may outperform existing in-storage designs and accelerator-based designs. Multiple performance comparisons for the ReCAM processing-in-storage implementations of $K$ -means and K-nearest neighbors are performed. Compared platforms include CPU, GPU, FPGA, and Automata Processor. We show that PRINS may achieve an order-of-magnitude speedup and improved power efficiency relative to all compared platforms.

[1] David G. Lowe,et al. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[2] S. Bhunia,et al. A Scalable Memory-Based Reconfigurable Computing Framework for Nanoscale Crossbar , 2012, IEEE Transactions on Nanotechnology.

[3] Hyunok Oh,et al. Collaborative processing of data-intensive algorithms with CPU, intelligent SSD, and GPU , 2016, SAC.

[4] Norman P. Jouppi,et al. FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[5] Uri C. Weiser,et al. MAGIC—Memristor-Aided Logic , 2014, IEEE Transactions on Circuits and Systems II: Express Briefs.

[6] David J. DeWitt,et al. Query processing on smart SSDs: opportunities and challenges , 2013, SIGMOD '13.

[7] Maya Gokhale,et al. Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[8] Fabien Alibart,et al. Hybrid CMOS/nanodevice circuits for high throughput pattern matching applications , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[9] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10] Dan Hammerstrom,et al. Methodology and Design of a Massively Parallel Memristive Stateful IMPLY Logic-Based Reconfigurable Architecture , 2016, IEEE Transactions on Nanotechnology.

[11] Thomas L. Sterling,et al. Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[12] Yue Zhao,et al. Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[13] Francisco Herrera,et al. GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs , 2016, Inf. Sci..

[14] Steven Swanson,et al. Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[15] S. Wong,et al. Monolithic 3D Integrated Circuits , 2007, 2007 International Symposium on VLSI Technology, Systems and Applications (VLSI-TSA).

[16] Dave Brown,et al. Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[17] G. Ghibaudo,et al. Understanding RRAM endurance, retention and window margin trade-off using experimental results and simulations , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[18] Jun Peng,et al. An Efficient KNN Algorithm Implemented on FPGA Based Heterogeneous Computing System Using OpenCL , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[19] Nishil Talati,et al. Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC) , 2016, IEEE Transactions on Nanotechnology.

[20] Svetlana Lazebnik,et al. Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[21] Eby G. Friedman,et al. AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[22] Hisashi Shima,et al. Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.

[23] Jason Weston,et al. #TagSpace: Semantic Embeddings from Hashtags , 2014, EMNLP.

[24] Subhasish Mitra,et al. Three-dimensional integration of nanotechnologies for computing and data storage on a single chip , 2017, Nature.

[25] W. C. Meilander,et al. Array processor supercomputers , 1989, Proc. IEEE.

[26] Engin Ipek,et al. A resistive TCAM accelerator for data-intensive computing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27] R. Williams,et al. Sub-nanosecond switching of a tantalum oxide memristor , 2011, Nanotechnology.

[28] Rajesh Gupta,et al. Minerva: Accelerating Data Analysis in Next-Generation SSDs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[29] Ran Ginosar,et al. Resistive Associative Processor , 2015, IEEE Computer Architecture Letters.

[30] Lingli Wang,et al. High-performance K-means Implementation based on a Simplified Map-Reduce Architecture , 2016, 1610.05601.

[31] Eby G. Friedman,et al. Resistive Ternary Content Addressable Memory Systems for Data-Intensive Computing , 2015, IEEE Micro.

[32] Ran Ginosar,et al. A Resistive CAM Processing-in-Storage Architecture for DNA Sequence Alignment , 2017, IEEE Micro.

[33] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[34] Miriam Leeser,et al. Accelerating K-Means clustering with parallel implementations and GPU computing , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[35] Chanik Park,et al. Enabling cost-effective data processing with smart SSD , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[36] Masahide Matsumoto,et al. A 130.7-$\hbox{mm}^{2}$ 2-Layer 32-Gb ReRAM Memory Device in 24-nm Technology , 2014, IEEE Journal of Solid-State Circuits.

[37] Jean-Philippe Martin,et al. Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[38] Karin Strauss,et al. Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[39] Michel Barlaud,et al. Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[40] Jaewook Shin,et al. Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[41] Shimeng Yu,et al. Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[42] Uri C. Weiser,et al. TEAM: ThrEshold Adaptive Memristor Model , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[43] Doohwan Oh,et al. XSD: Accelerating MapReduce by Harnessing the GPU inside an SSD , 2013 .

[44] Peter Desnoyers,et al. Active Flash: Out-of-core data analytics on flash storage , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[45] Ran Ginosar,et al. Deduplication in resistive content addressable memory based solid state drive , 2016, 2016 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[46] X. Miao,et al. Realization of Functional Complete Stateful Boolean Logic in Memristive Crossbar. , 2016, ACS applied materials & interfaces.

[47] Armin Alaghi,et al. Similarity Search on Automata Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[48] J Joshua Yang,et al. Memristive devices for computing. , 2013, Nature nanotechnology.

[49] Matt J. Kusner,et al. From Word Embeddings To Document Distances , 2015, ICML.

[50] George A. Constantinides,et al. A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.