An interoperable service for the provenance of machine learning experiments

Nowadays, despite the fact that Machine Learning (ML) experiments can be easily built using several ML frameworks, as the demand for practical solutions for several kinds of scientific problems is always increasing, organizing its results and the different algorithms' setups used, in order to be able to reproduce them, is a long known problem without an easy solution. Motivated by the need of a high level of interoperability and data provenance with respect to ML experiments, this work presents a generic solution using a web-service application that interacts with the MEX vocabulary, a lightweight solution for archiving and querying ML experiments. By using this solution, researchers can share their setups and results, in a interoperable format that describes all the steps needed to reproduce their research. Although the solution presented in this work could be implemented in any programming language, we chose Java to build the web-service and also we chose to present experiments with Python's Scikit-learn ML Framework, using Decorators and Code Reflection, that demonstrates the simplicity of incorporating data provenance in such a high level, simplifying the experiment logging process.

[1]  Jens Lehmann,et al.  Interoperable Machine Learning Metadata using MEX , 2015, International Semantic Web Conference.

[2]  C. Maria Keet,et al.  The Data Mining OPtimization Ontology , 2015, J. Web Semant..

[3]  Luc Moreau,et al.  The Open Provenance Model: An Overview , 2008, IPAW.

[4]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[7]  Marta Mattoso,et al.  Towards a Taxonomy of Provenance in Scientific Workflow Management Systems , 2009, 2009 Congress on Services - I.

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[10]  Jens Lehmann,et al.  MEX vocabulary: a lightweight interchange format for machine learning experiments , 2015, SEMANTICS.

[11]  James Cheney,et al.  PROV-O: The PROV ontology:W3C recommendation 30 April 2013 , 2013 .

[12]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[13]  Bernardo Cuenca Grau,et al.  OWL 2 Web Ontology Language: Profiles , 2009 .

[14]  Joaquin Vanschoren,et al.  Exposé: An ontology for data mining experiments , 2010 .

[15]  Yolanda Gil,et al.  PROV-DM: The PROV Data Model , 2013 .

[16]  Saso Dzeroski,et al.  OntoDM-KDD: Ontology for Representing the Knowledge Discovery Process , 2013, Discovery Science.

[17]  Saso Dzeroski,et al.  An Algorithm, Implementation and Execution Ontology Design Pattern , 2016, WOP@ISWC.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[20]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[21]  Jens Lehmann,et al.  MEX Interfaces: Automating Machine Learning Metadata Generation , 2016, SEMANTiCS.

[22]  Deborah L. McGuinness,et al.  PROV-O: The PROV Ontology , 2013 .