Near-Data Processing for Machine Learning

In computer architecture, near-data processing (NDP) refers to augmenting the memory or the storage with processing power so that it can process the data stored therein. By offloading the computational burden of CPU and saving the need for transferring raw data in its entirety, NDP exhibits a great potential for acceleration and power reduction. Despite this potential, specific research activities on NDP have witnessed only limited success until recently, often owing to performance mismatches between logic and memory process technologies that put a limit on the processing capability of memory. Recently, there have been two major changes in the game, igniting the resurgence of NDP with renewed interest. The first is the success of machine learning (ML), which often demands a great deal of computation for training, requiring frequent transfers of big data. The second is the advent of NAND flash-based solid-state drives (SSDs) containing multicore processors that can accommodate extra computation for data processing. Sparked by these application needs and technological support, we evaluate the potential of NDP for ML using a new SSD platform that allows us to simulate in-storage processing (ISP) of ML workloads. Our platform (named ISP-ML) is a full-fledged simulator of a realistic multi-channel SSD that can execute various ML algorithms using the data stored in the SSD. For thorough performance analysis and in-depth comparison with alternatives, we focus on a specific algorithm: stochastic gradient decent (SGD), which is the de facto standard for training differentiable learning machines including deep neural networks. We implement and compare three variants of SGD (synchronous, Downpour, and elastic averaging) using ISP-ML, exploiting the multiple NAND channels for parallelizing SGD. In addition, we compare the performance of ISP and that of conventional in-host processing, revealing the advantages of ISP. Based on the advantages and limitations identified through our experiments, we further discuss directions for future research on ISP for accelerating ML.

[1]  Sang-Won Lee,et al.  In-storage processing of database scans and joins , 2016, Inf. Sci..

[2]  Sungjin Lee,et al.  BlueDBM: An appliance for Big Data analytics , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[3]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[4]  Maya Gokhale,et al.  Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.

[5]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[6]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[9]  Youngjae Kim,et al.  DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings , 2009, ASPLOS.

[10]  Sang Lyul Min,et al.  A space-efficient flash translation layer for CompactFlash systems , 2002, IEEE Trans. Consumer Electron..

[11]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[12]  Sungroh Yoon,et al.  Architecture Exploration of High-Performance PCs with a Solid-State Disk , 2010, IEEE Transactions on Computers.

[13]  Hai Jin,et al.  Active Disks: Programming Model, Algorithms and Evaluation , 2002 .

[14]  Proceedings of the 2015 International Symposium on Memory Systems , 2015, MEMSYS.

[15]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[16]  Jin-Soo Kim,et al.  ActiveSort: Efficient external sorting using active SSDs in the MapReduce framework , 2016, Future Gener. Comput. Syst..

[17]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[18]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[19]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[20]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[21]  Ki-Whan Song,et al.  NAND Flash Memory With Multiple Page Sizes for High-Performance Storage Devices , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[22]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[23]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[24]  I. Stephen Choi,et al.  Energy Efficient Scale-In Clusters with In-Storage Processing for Big-Data Analytics , 2015, MEMSYS.