Data Platform for Machine Learning

In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among data versions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which cycle through data discovery, data exploration, feature engineering, model training, model evaluation, and back to data discovery. The contributions of this paper are: 1) to recognize the needs and to call out the requirements of an ML data platform, 2) to share our experiences in building MLdp by adopting existing database technologies to the new problem as well as by devising new solutions, and 3) to call for actions from our communities on future challenges.

[1]  Sriram Subramanian,et al.  Model Governance: Reducing the Anarchy of Production ML , 2018, USENIX Annual Technical Conference.

[2]  Shirish Tatikonda,et al.  SystemML: Declarative Machine Learning on Spark , 2016, Proc. VLDB Endow..

[3]  Aditya G. Parameswaran,et al.  Decibel: The Relational Dataset Branching System , 2016, Proc. VLDB Endow..

[4]  Wei Xu,et al.  DataLab: A Version Data Management and Analytics System , 2016, 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE).

[5]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[6]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[7]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[8]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[9]  Aditya G. Parameswaran,et al.  OrpheusDB: A Lightweight Approach to Relational Dataset Versioning , 2017, SIGMOD Conference.

[10]  Aditya G. Parameswaran,et al.  DataHub: Collaborative Data Science & Dataset Version Management at Scale , 2014, CIDR.

[11]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Riccardo Torlone,et al.  Crossing the finish line faster when paddling the Data Lake with Kayak , 2017, Proc. VLDB Endow..

[13]  Amol Deshpande,et al.  ProvDB: Lifecycle Management of Collaborative Analysis Workflows , 2017, HILDA@SIGMOD.

[14]  Berthold Reinwald,et al.  Declarative Machine Learning - A Classification of Basic Properties and Types , 2016, ArXiv.

[15]  Larry S. Davis,et al.  Towards Unified Data and Lifecycle Management for Deep Learning , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .