Cost-effective, Energy-efficient, and Scalable Storage Computing for Large-scale AI Applications

The growing volume of data produced continuously in the Cloud and at the Edge poses significant challenges for large-scale AI applications to extract and learn useful information from the data in a timely and efficient way. The goal of this article is to explore the use of computational storage to address such challenges by distributed near-data processing. We describe Newport, a high-performance and energy-efficient computational storage developed for realizing the full potential of in-storage processing. To the best of our knowledge, Newport is the first commodity SSD that can be configured to run a server-like operating system, greatly minimizing the effort for creating and maintaining applications running inside the storage. We analyze the benefits of using Newport by running complex AI applications such as image similarity search and object tracking on a large visual dataset. The results demonstrate that data-intensive AI workloads can be efficiently parallelized and offloaded, even to a small set of Newport drives with significant performance gains and energy savings. In addition, we introduce a comprehensive taxonomy of existing computational storage solutions together with a realistic cost analysis for high-volume production, giving a good big picture of the economic feasibility of the computational storage technology.

[1]  Jaeyoung Do,et al.  Programmable solid-state storage in future cloud datacenters , 2019, Commun. ACM.

[2]  Jinyoung Lee,et al.  Biscuit: A Framework for Near-Data Processing of Big Data Workloads , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification and Regression , 1995, NIPS.

[4]  Siavash Rezaei,et al.  UltraShare: FPGA-based Dynamic Accelerator Sharing and Allocation , 2019, 2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[5]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[6]  Jiri Matas,et al.  A Novel Performance Evaluation Methodology for Single-Target Trackers , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yang Liu,et al.  Willow: A User-Programmable SSD , 2014, OSDI.

[8]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[9]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[11]  Hossein Bobarshad,et al.  Stannis: Low-Power Acceleration of DNN Training Using Computational Storage Devices , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[12]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[13]  Chanik Park,et al.  Enabling cost-effective data processing with smart SSD , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Chanik Park,et al.  Active disk meets flash: a case for intelligent SSDs , 2013, ICS '13.

[15]  Hossein Bobarshad,et al.  STANNIS: Low-Power Acceleration of Deep Neural Network Training Using Computational Storage , 2020, ArXiv.

[16]  Sangyeun Cho,et al.  YourSQL: A High-Performance Database System Leveraging In-Storage Computing , 2016, Proc. VLDB Endow..

[17]  Richard L. Graham,et al.  Open MPI: A Flexible High Performance MPI , 2005, PPAM.

[18]  Wei Wu,et al.  High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Elaheh Bozorgzadeh,et al.  Scalable Multi-Queue Data Transfer Scheme for FPGA-Based Multi-Accelerators , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[20]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Matthew T. O'Keefe,et al.  The Global File System: A File System for Shared Disk Storage , 1997 .

[22]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Hossein Bobarshad,et al.  HyperTune: Dynamic Hyperparameter Tuning for Efficient Distribution of DNN Training Over Heterogeneous Systems , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[24]  Nader Bagherzadeh,et al.  CompStor: An In-storage Computation Platform for Scalable Distributed Processing , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[25]  Zhihai He,et al.  Spatially supervised recurrent convolutional neural networks for visual object tracking , 2016, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[26]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Radford M. Neal,et al.  Near Shannon limit performance of low density parity check codes , 1996 .

[29]  Arvind,et al.  What is Bluespec? , 2008, SIGD.

[30]  Rino Micheloni,et al.  SSD Architecture and PCI Express Interface , 2013 .

[31]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[32]  Tei-Wei Kuo,et al.  Garbage collection and wear leveling for flash memory: Past and future , 2014, 2014 International Conference on Smart Computing.

[33]  mark. fasheh OCFS 2 : The Oracle Clustered File System , Version 2 , 2010 .

[34]  Felipe Maia Galvão França,et al.  Online tracking of multiple objects using WiSARD , 2014, ESANN.

[35]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Sang-Won Lee,et al.  In-storage processing of database scans and joins , 2016, Inf. Sci..

[38]  Dimitrios Gunopulos,et al.  Locally Adaptive Metric Nearest-Neighbor Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  David J. DeWitt,et al.  Query processing on smart SSDs: opportunities and challenges , 2013, SIGMOD '13.

[40]  Youngjae Kim,et al.  DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings , 2009, ASPLOS.

[41]  I. Aleksander,et al.  WISARD·a radical step forward in image recognition , 1984 .

[42]  Mahdi Torabzadehkashi,et al.  Computational storage: an efficient and scalable platform for big data and HPC applications , 2019, Journal of Big Data.

[43]  Alexander V. Veidenbaum,et al.  Data-rate-aware FPGA-based acceleration framework for streaming applications , 2016, 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[44]  Helen M. Wood,et al.  Foreword to the First Issue of the Transactions on Parallel and Distributed Systems , 1990, IEEE Trans. Parallel Distributed Syst..

[45]  Bruce A. Draper,et al.  Visual object tracking using adaptive correlation filters , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Hossein Bobarshad,et al.  Catalina: In-Storage Processing Acceleration for Scalable Big Data Analytics , 2019, 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).

[47]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Tei-Wei Kuo,et al.  Real-time garbage collection for flash-memory storage systems of real-time embedded systems , 2004, TECS.

[49]  Farinaz Koushanfar,et al.  CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs , 2019, ArXiv.

[50]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[51]  Sungjin Lee,et al.  BlueDBM: An appliance for Big Data analytics , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[52]  Martin Aumüller,et al.  ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms , 2018, SISAP.

[53]  Steven Swanson,et al.  Summarizer: Trading Communication with Computing Near Storage , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[54]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[55]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[56]  Michael Cornwell,et al.  Anatomy of a solid-state drive , 2012, CACM.

[57]  Sang-Won Lee,et al.  A survey of Flash Translation Layer , 2009, J. Syst. Archit..

[58]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[59]  Ali Borji,et al.  Salient object detection: A survey , 2014, Computational Visual Media.

[60]  Felipe Maia Galvão França,et al.  A WiSARD-based multi-term memory framework for online tracking of objects , 2015, ESANN.

[61]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[62]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[63]  Ming-Hsuan Yang,et al.  Object Tracking Benchmark , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Nader Bagherzadeh,et al.  Accelerating HPC Applications Using Computational Storage Devices , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[66]  Peter Desnoyers,et al.  Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines , 2013, FAST.

[67]  Massoud Pedram,et al.  JointDNN: An Efficient Training and Inference Engine for Intelligent Mobile Cloud Computing Services , 2018, IEEE Transactions on Mobile Computing.

[68]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[69]  Massimo De Gregorio,et al.  Movement persuit control of an offshore automated platform via a RAM-based neural network , 2010, 2010 11th International Conference on Control Automation Robotics & Vision.

[70]  Nanning Zheng,et al.  LDPC-in-SSD: making advanced error correction codes work effectively in solid state drives , 2013, FAST.