Versioning for End-to-End Machine Learning Pipelines

End-to-end machine learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Those results might come in the form of datasets for data processing pipelines or in the form of model coefficients in case of model training pipelines. Reusing persisted results improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed. In this paper we build upon previous work to produce derivations of datasets to ensure that multiple versions of a pipeline can run in parallel while minimizing the amount of redundant computations. Our extensions include partial derivations to simplify navigation and reuse, explicit support for schema changes of pipelines, and a central registry of running pipelines to coordinate upgrading pipelines between teams.

[1]  Tom van der Weide,et al.  Versioned machine learning pipelines for batch experimentation , 2016 .

[2]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[3]  Steven Raemaekers,et al.  Semantic versioning and impact of breaking changes in the Maven repository , 2017, J. Syst. Softw..

[4]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[5]  Aditya G. Parameswaran,et al.  Decibel: The Relational Dataset Branching System , 2016, Proc. VLDB Endow..

[6]  Ralph E. Johnson,et al.  How do APIs evolve? A story of refactoring , 2006 .

[7]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[8]  Manasi Vartak,et al.  Supporting Fast Iteration in Model Building , 2015 .

[9]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[10]  Vanja Josifovski,et al.  Web-scale user modeling for targeting , 2012, WWW.

[11]  D. Lanter Design of a Lineage-Based Meta-Data Base for GIS , 1991 .

[12]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[13]  Manasi Vartak,et al.  ModelDB: a system for machine learning model management , 2016, HILDA '16.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Luc Moreau,et al.  Provenance of e-Science Experiments - Experience from Bioinformatics , 2003 .

[16]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[17]  Andres Löh,et al.  NixOS: a purely functional Linux distribution , 2008, ICFP.

[18]  David R. Karger,et al.  Collaborative Data Analytics with DataHub , 2015, Proc. VLDB Endow..

[19]  Andreas Haeberlen,et al.  Data Provenance at Internet Scale: Architecture, Experiences, and the Road Ahead , 2017, CIDR.