Asset Management in Machine Learning: A Survey

Machine Learning (ML) techniques are becoming essential components of many software systems today, causing an increasing need to adapt traditional software engineering practices and tools to the development of ML-based software systems. This need is especially pronounced due to the challenges associated with the large-scale development and deployment of ML systems. Among the most commonly reported challenges during the development, production, and operation of ML-based systems are experiment management, dependency management, monitoring, and logging of ML assets. In recent years, we have seen several efforts to address these challenges as witnessed by an increasing number of tools for tracking and managing ML experiments and their assets. To facilitate research and practice on engineering intelligent systems, it is essential to understand the nature of the current tool support for managing ML assets. What kind of support is provided? What asset types are tracked? What operations are offered to users for managing those assets? We discuss and position ML asset management as an important discipline that provides methods and tools for ML assets as structures and the ML development activities as their operations. We present a feature-based survey of 17 tools with ML asset management support identified in a systematic search. We overview these tools' features for managing the different types of assets used for engineering ML-based systems and performing experiments. We found that most of the asset management support depends on traditional version control systems, while only a few tools support an asset granularity level that differentiates between important ML assets, such as datasets and models.

[1]  Rachel K. E. Bellamy,et al.  Trials and tribulations of developers of intelligent systems: A field study , 2016, 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[2]  Jan-Philipp Steghöfer,et al.  The state of adoption and the challenges of systematic variability management in industry , 2020, Empirical Software Engineering.

[3]  Jan Bosch,et al.  Software Engineering Challenges of Deep Learning , 2018, 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA).

[4]  Marsha Chechik,et al.  What is a feature?: a qualitative study of features in industrial software product lines , 2015, SPLC.

[5]  Christer Åhlund,et al.  Machine learning in district heating system energy optimization , 2014, 2014 IEEE International Conference on Pervasive Computing and Communication Workshops (PERCOM WORKSHOPS).

[6]  Rudolf Ferenc,et al.  Deep-water framework: The Swiss army knife of humans working with machine learning models , 2020, SoftwareX.

[7]  Odd Erik Gundersen,et al.  Out-of-the-Box Reproducibility: A Survey of Machine Learning Platforms , 2019, 2019 15th International Conference on eScience (eScience).

[8]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[9]  T. Berger,et al.  Feature-oriented defect prediction , 2020, SPLC.

[10]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[11]  Rüdiger Wirth,et al.  CRISP-DM: Towards a Standard Process Model for Data Mining , 2000 .

[12]  Christer Åhlund,et al.  Forecasting heat load for smart district heating systems: A machine learning approach , 2014, 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm).

[13]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[14]  Souti Chattopadhyay,et al.  What's Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities , 2020, CHI.

[15]  Krzysztof Czarnecki,et al.  Towards predicting feature defects in software product lines , 2016, FOSD@SPLASH.

[16]  Pearl Brereton,et al.  Performing systematic literature reviews in software engineering , 2006, ICSE.

[17]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[18]  Saguna Saguna,et al.  Applied machine learning: Forecasting heat load in district heating system , 2016 .

[19]  Iqbal H. Sarker,et al.  A Survey of Software Development Process Models in Software Engineering , 2015 .

[20]  Jacob Krüger,et al.  Principles of feature modeling , 2019, ESEC/SIGSOFT FSE.

[21]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[22]  Michael L. Hines,et al.  ModelDB - Making models publicly accessible to support computational neuroscience , 2003, Neuroinformatics.

[23]  Neoklis Polyzotis,et al.  Data Management Challenges in Production Machine Learning , 2017, SIGMOD Conference.

[24]  Manasi Vartak,et al.  ModelDB: a system for machine learning model management , 2016, HILDA '16.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  Kyo Chul Kang,et al.  Feature-Oriented Domain Analysis (FODA) Feasibility Study , 1990 .

[27]  Sriram Subramanian,et al.  Model Governance: Reducing the Anarchy of Production ML , 2018, USENIX Annual Technical Conference.

[28]  Daniel Strüber,et al.  A maturity assessment framework for conversational AI development platforms , 2020, SAC.

[29]  Michael Granitzer,et al.  Mapping platforms into a new open science model for machine learning , 2019, it Inf. Technol..

[30]  Fumihiro Kumeno,et al.  Sofware engneering challenges for machine learning applications: A literature review , 2020, Intell. Decis. Technol..

[31]  Eelco Visser,et al.  The State of the Art in Language Workbenches - Conclusions from the Language Workbench Challenge , 2013, SLE.

[32]  Andreas Burger,et al.  Semi-Automated Feature Traceability with Embedded Annotations , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[33]  Sebastian Schelter,et al.  Declarative Metadata Management : A Missing Piece in End-To-End Machine Learning , 2018 .

[34]  Paul Grünbacher,et al.  A classification of variation control systems , 2017, GPCE.

[35]  Jake VanderPlas,et al.  A Practical Taxonomy of Reproducibility for Machine Learning Research , 2018 .

[36]  Alexandru A. Ormenisan,et al.  Implicit Provenance for Machine Learning Artifacts , 2020 .

[37]  Krzysztof Czarnecki,et al.  Feature-based survey of model transformation approaches , 2006, IBM Syst. J..