PRINS: Processing-in-Storage Acceleration of Machine Learning

Machine learning algorithms have become a major tool in various applications. The high-performance requirements on large-scale datasets pose a challenge for traditional von Neumann architectures. We present two machine learning implementations and evaluations on PRINS, a novel processing-in-storage system based on resistive content addressable memory (ReCAM). PRINS functions simultaneously as a storage and a massively parallel associative processor. PRINS processing-in-storage resolves the bandwidth wall faced by near-data von Neumann architectures, such as three-dimensional DRAM and CPU stack or SSD with embedded CPU, by keeping the computing inside the storage arrays, thus implementing in-data, rather than near-data, processing. We show that PRINS-based processing-in-storage architecture may outperform existing in-storage designs and accelerator-based designs. Multiple performance comparisons for the ReCAM processing-in-storage implementations of $K$ -means and K-nearest neighbors are performed. Compared platforms include CPU, GPU, FPGA, and Automata Processor. We show that PRINS may achieve an order-of-magnitude speedup and improved power efficiency relative to all compared platforms.

[1]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[2]  S. Bhunia,et al.  A Scalable Memory-Based Reconfigurable Computing Framework for Nanoscale Crossbar , 2012, IEEE Transactions on Nanotechnology.

[3]  Hyunok Oh,et al.  Collaborative processing of data-intensive algorithms with CPU, intelligent SSD, and GPU , 2016, SAC.

[4]  Norman P. Jouppi,et al.  FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[5]  Uri C. Weiser,et al.  MAGIC—Memristor-Aided Logic , 2014, IEEE Transactions on Circuits and Systems II: Express Briefs.

[6]  David J. DeWitt,et al.  Query processing on smart SSDs: opportunities and challenges , 2013, SIGMOD '13.

[7]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[8]  Fabien Alibart,et al.  Hybrid CMOS/nanodevice circuits for high throughput pattern matching applications , 2011, 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS).

[9]  Yu Wang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  Dan Hammerstrom,et al.  Methodology and Design of a Massively Parallel Memristive Stateful IMPLY Logic-Based Reconfigurable Architecture , 2016, IEEE Transactions on Nanotechnology.

[11]  Thomas L. Sterling,et al.  Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[12]  Yue Zhao,et al.  Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[13]  Francisco Herrera,et al.  GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs , 2016, Inf. Sci..

[14]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[15]  S. Wong,et al.  Monolithic 3D Integrated Circuits , 2007, 2007 International Symposium on VLSI Technology, Systems and Applications (VLSI-TSA).

[16]  Dave Brown,et al.  Supplementary Material for An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing , 2013 .

[17]  G. Ghibaudo,et al.  Understanding RRAM endurance, retention and window margin trade-off using experimental results and simulations , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[18]  Jun Peng,et al.  An Efficient KNN Algorithm Implemented on FPGA Based Heterogeneous Computing System Using OpenCL , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[19]  Nishil Talati,et al.  Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC) , 2016, IEEE Transactions on Nanotechnology.

[20]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[21]  Eby G. Friedman,et al.  AC-DIMM: associative computing with STT-MRAM , 2013, ISCA.

[22]  Hisashi Shima,et al.  Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.

[23]  Jason Weston,et al.  #TagSpace: Semantic Embeddings from Hashtags , 2014, EMNLP.

[24]  Subhasish Mitra,et al.  Three-dimensional integration of nanotechnologies for computing and data storage on a single chip , 2017, Nature.

[25]  W. C. Meilander,et al.  Array processor supercomputers , 1989, Proc. IEEE.

[26]  Engin Ipek,et al.  A resistive TCAM accelerator for data-intensive computing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  R. Williams,et al.  Sub-nanosecond switching of a tantalum oxide memristor , 2011, Nanotechnology.

[28]  Rajesh Gupta,et al.  Minerva: Accelerating Data Analysis in Next-Generation SSDs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[29]  Ran Ginosar,et al.  Resistive Associative Processor , 2015, IEEE Computer Architecture Letters.

[30]  Lingli Wang,et al.  High-performance K-means Implementation based on a Simplified Map-Reduce Architecture , 2016, 1610.05601.

[31]  Eby G. Friedman,et al.  Resistive Ternary Content Addressable Memory Systems for Data-Intensive Computing , 2015, IEEE Micro.

[32]  Ran Ginosar,et al.  A Resistive CAM Processing-in-Storage Architecture for DNA Sequence Alignment , 2017, IEEE Micro.

[33]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[34]  Miriam Leeser,et al.  Accelerating K-Means clustering with parallel implementations and GPU computing , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[35]  Chanik Park,et al.  Enabling cost-effective data processing with smart SSD , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[36]  Masahide Matsumoto,et al.  A 130.7-$\hbox{mm}^{2}$ 2-Layer 32-Gb ReRAM Memory Device in 24-nm Technology , 2014, IEEE Journal of Solid-State Circuits.

[37]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[38]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[39]  Michel Barlaud,et al.  Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[40]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[41]  Shimeng Yu,et al.  Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[42]  Uri C. Weiser,et al.  TEAM: ThrEshold Adaptive Memristor Model , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[43]  Doohwan Oh,et al.  XSD: Accelerating MapReduce by Harnessing the GPU inside an SSD , 2013 .

[44]  Peter Desnoyers,et al.  Active Flash: Out-of-core data analytics on flash storage , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[45]  Ran Ginosar,et al.  Deduplication in resistive content addressable memory based solid state drive , 2016, 2016 26th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS).

[46]  X. Miao,et al.  Realization of Functional Complete Stateful Boolean Logic in Memristive Crossbar. , 2016, ACS applied materials & interfaces.

[47]  Armin Alaghi,et al.  Similarity Search on Automata Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[48]  J Joshua Yang,et al.  Memristive devices for computing. , 2013, Nature nanotechnology.

[49]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[50]  George A. Constantinides,et al.  A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.