Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches

Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DBresident data, we show that there is no single “best” approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds. PVLDB Reference Format: Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar. Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches. PVLDB, 14(10): 1769 1782, 2021. doi:10.14778/3467861.3467867 ∗Work done mostly while at Pivotal (now VMware). This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14, No. 10 ISSN 2150-8097. doi:10.14778/3467861.3467867

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[3]  Supun Nakandala,et al.  Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale , 2020, SIGMOD Conference.

[4]  Supun Nakandala,et al.  Cerebro: A Data System for Optimized Deep Learning Model Selection , 2020, Proc. VLDB Endow..

[5]  Bin Cui,et al.  MLog: Towards Declarative In-Database Machine Learning , 2017, Proc. VLDB Endow..

[6]  Yu Cheng,et al.  GLADE: big data analytics made easy , 2012, SIGMOD Conference.

[7]  Carsten Binnig,et al.  Democratizing Data Science through Interactive Curation of ML Pipelines , 2019, SIGMOD Conference.

[8]  Nick Koudas,et al.  Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse , 2018, Proc. VLDB Endow..

[9]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[10]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[11]  Hung Q. Ngo,et al.  In-Database Learning with Sparse Tensors , 2017, PODS.

[12]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[13]  Zhipeng Zhang,et al.  PS2: Parameter Server on Spark , 2019, SIGMOD Conference.

[14]  Berti-Equille Laure,et al.  Machine Learning to Data Management: A Round Trip , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[15]  Zhipeng Zhang,et al.  MLlib*: Fast Training of GLMs Using Spark MLlib , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[16]  Chunbin Lin,et al.  Accelerating Analytic Queries on Compressed Data , 2018 .

[17]  Supun Nakandala,et al.  Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems , 2019, DEEM@SIGMOD.

[18]  Raul Castro Fernandez,et al.  Ako: Decentralised Deep Learning with Partial Gradient Exchange , 2016, SoCC.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Aditya G. Parameswaran,et al.  Helix: Holistic Optimization for Accelerating Iterative Machine Learning , 2018, Proc. VLDB Endow..

[21]  P. Alam ‘K’ , 2021, Composites Engineering.

[22]  David J. DeWitt,et al.  The Object-Oriented Database System Manifesto , 1994, Building an Object-Oriented Database System, The Story of O2.

[23]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[24]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[25]  Christopher Ré,et al.  Extracting Databases from Dark Data with DeepDive , 2016, SIGMOD Conference.

[26]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[27]  Wenwu Zhu,et al.  Structural Deep Network Embedding , 2016, KDD.

[28]  Takashi Matsubara,et al.  Deep learning for stock prediction using numerical and textual information , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[29]  Eric Eide,et al.  Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications , 2014, login Usenix Mag..

[30]  Reynold Xin,et al.  Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics , 2021, CIDR.

[31]  Hang Su,et al.  Experiments on Parallel Training of Deep Neural Network using Model Averaging , 2015, ArXiv.

[32]  Shirish Tatikonda,et al.  SystemML: Declarative Machine Learning on Spark , 2016, Proc. VLDB Endow..

[33]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[34]  Feng Liu,et al.  Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment , 2019, SysML.

[35]  Matthias Weidlich,et al.  Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers , 2019, Proc. VLDB Endow..

[36]  Ameet Talwalkar,et al.  A System for Massively Parallel Hyperparameter Tuning , 2020, MLSys.

[37]  Chuck Bear,et al.  Vertica-ML: Distributed Machine Learning in Vertica Database , 2020, SIGMOD Conference.

[38]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[39]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[40]  Arun Kumar,et al.  Cerebro: A Layered Data Platform for Scalable Deep Learning , 2021, CIDR.

[41]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[42]  Ce Zhang,et al.  Ease.ml/ci and Ease.ml/meter in Action: Towards Data Management for Statistical Generalization , 2019, Proc. VLDB Endow..

[43]  David Antonio Justo Write once, rewrite everywhere: A Unified Framework for Factorized Machine Learning , 2019 .

[44]  Dit-Yan Yeung,et al.  Collaborative Deep Learning for Recommender Systems , 2014, KDD.

[45]  Kun Li,et al.  UDA-GIST: An In-database Framework to Unify Data-Parallel and State-Parallel Analytics , 2015, Proc. VLDB Endow..

[46]  Ioannis Mitliagkas,et al.  Parallel SGD: When does averaging help? , 2016, ArXiv.

[47]  Özgür Yilmazel,et al.  Apache Mahout: Machine Learning on Distributed Dataflow Systems , 2020, J. Mach. Learn. Res..

[48]  Neoklis Polyzotis,et al.  Data Management Challenges in Production Machine Learning , 2017, SIGMOD Conference.

[49]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[50]  Gang Chen,et al.  SINGA: Putting Deep Learning in the Hands of Multimedia Users , 2015, ACM Multimedia.

[51]  Manasi Vartak,et al.  ModelDB: a system for machine learning model management , 2016, HILDA '16.

[52]  C. Jermaine,et al.  Tensor Relational Algebra for Distributed Machine Learning System Design , 2020, Proc. VLDB Endow..

[53]  Chris Jermaine,et al.  Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization , 2019, IEEE Transactions on Knowledge and Data Engineering.

[54]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[55]  Chris Jermaine,et al.  Declarative Recursive Computation on an RDBMS , 2019, Proc. VLDB Endow..

[56]  Dynamic parameter allocation in parameter servers , 2020, Proc. VLDB Endow..

[57]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[58]  Juliana Freire,et al.  Visus: An Interactive System for Automatic Machine Learning Model Building and Curation , 2019, HILDA@SIGMOD.

[59]  Chris Jermaine,et al.  Declarative Recursive Computation on an RDBMS, or, Why You Should Use a Database For Distributed Machine Learning , 2019, ArXiv.

[60]  Beng Chin Ooi,et al.  Rafiki: Machine Learning as an Analytics Service System , 2018, Proc. VLDB Endow..

[61]  Dennis Shasha,et al.  Debugging Machine Learning Pipelines , 2019, DEEM@SIGMOD.

[62]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[63]  Stephan Günnemann,et al.  MLearn: A Declarative Machine Learning Language for Database Systems , 2019, DEEM@SIGMOD.

[64]  Anthony K. H. Tung,et al.  SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.

[65]  WangWei,et al.  Effective deep learning-based multi-modal retrieval , 2016, VLDB 2016.

[66]  Quanshi Zhang,et al.  Visual interpretability for deep learning: a survey , 2018, Frontiers of Information Technology & Electronic Engineering.

[67]  Benjamin Recht,et al.  KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[68]  Tim Kraska,et al.  ARDA , 2020, Proc. VLDB Endow..

[69]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[70]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[71]  Atsuo Yoshitaka,et al.  A Survey on Content-Based Retrieval for Multimedia Databases , 1999, IEEE Trans. Knowl. Data Eng..

[72]  Abutalib Aghayev,et al.  Litz: Elastic Framework for High-Performance Distributed Machine Learning , 2018, USENIX Annual Technical Conference.

[73]  Tilmann Rabl,et al.  An Intermediate Representation for Optimizing Machine Learning Pipelines , 2019, Proc. VLDB Endow..

[74]  Claire Laybats,et al.  GDPR , 2018, Business Information Review.

[75]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[76]  Stephen H. Bach,et al.  Snorkel: rapid training data creation with weak supervision , 2019, The VLDB Journal.

[77]  Bettina Kemme,et al.  AIDA - Abstraction for Advanced In-Database Analytics , 2018, Proc. VLDB Endow..

[78]  Sanjay Krishnan,et al.  BoostClean: Automated Error Detection and Repair for Machine Learning , 2017, ArXiv.

[79]  Nishant Agarwal A Real-time Temporal Clustering Algorithm for short text, and its applications , 2017 .

[80]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[81]  Matthew Rocklin,et al.  Dask: Parallel Computation with Blocked algorithms and Task Scheduling , 2015, SciPy.

[82]  Shirish Tatikonda,et al.  Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..

[83]  Shu Lin,et al.  DISIMA: a distributed and interoperable image database system , 2000, SIGMOD '00.

[84]  Masahito Hirakawa,et al.  MORE: An Object-Oriented Data Model with a Facility for Changing Object Structures , 1991, IEEE Trans. Knowl. Data Eng..

[85]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[86]  Christopher Ré,et al.  Probabilistic Management of OCR Data using an RDBMS , 2011, Proc. VLDB Endow..

[87]  Michael N. Gubanov,et al.  Scalable Linear Algebra on a Relational Database System , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[88]  Gang Fu,et al.  Deep & Cross Network for Ad Click Predictions , 2017, ADKDD@KDD.

[89]  Carlo Curino,et al.  Extending Relational Query Processing with ML Inference , 2019, CIDR.

[90]  Susie Stephens,et al.  Oracle Data Mining , 2005 .

[91]  Jeffrey F. Naughton,et al.  Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent , 2017, SIGMOD Conference.

[92]  Alin Deutsch,et al.  Vertex-centric Parallel Computation of SQL Queries , 2021, SIGMOD Conference.

[93]  Felix Bießmann,et al.  On Challenges in Machine Learning Model Management , 2018, IEEE Data Eng. Bull..

[94]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[95]  Alun D. Preece,et al.  Interpretability of deep learning models: A survey of results , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[96]  Jiaheng Lu,et al.  Tutorial Proposal : Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join , 2019 .

[97]  Xavier Bouthillier,et al.  Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020 , 2020 .

[98]  Berthold Reinwald,et al.  On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML , 2018, Proc. VLDB Endow..

[99]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[100]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[101]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[102]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[103]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[104]  Tilmann Rabl,et al.  Optimizing Machine Learning Workloads in Collaborative Environments , 2020, SIGMOD Conference.

[105]  Samuel Madden,et al.  MODELDB: Opportunities and Challenges in Managing Machine Learning Models , 2018, IEEE Data Eng. Bull..

[106]  Markus Weimer,et al.  Vamsa: Automated Provenance Tracking in Data Science Scripts , 2020, KDD.

[107]  Stefan Manegold,et al.  Deep Integration of Machine Learning Into Column Stores , 2018, EDBT.

[108]  Carsten Binnig,et al.  DB4ML - An In-Memory Database Kernel with Machine Learning Support , 2020, SIGMOD Conference.

[109]  Carlos Ordonez,et al.  Integrating K-means clustering with a relational DBMS using SQL , 2006, IEEE Transactions on Knowledge and Data Engineering.

[110]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[111]  Frederick Reiss,et al.  Compressed linear algebra for large-scale machine learning , 2016, The VLDB Journal.

[112]  K. Selçuk Candan,et al.  Efficient Static and Dynamic In-Database Tensor Decompositions on Chunk-Based Array Stores , 2014, CIKM.

[113]  Carlo Curino,et al.  Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML , 2020, CIDR.