A Recommender System for Scientific Datasets and Analysis Pipelines

Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resources. We investigate the feasibility of a collaborative filtering system to recommend pipelines and datasets based on provenance records from previous executions. We evaluate our system using datasets and pipelines extracted from the Canadian Open Neuroscience Platform, a national initiative for open neuroscience. The recommendations provided by our system (AUC= 0.83) are significantly better than chance and outperform recommendations made by domain experts using their previous knowledge as well as pipeline and dataset descriptions (AUC= 0.63). In particular, domain experts often neglect low-level technical aspects of a pipeline-dataset interaction, such as the level of pre-processing, which are captured by a provenance-based system. We conclude that provenance-based pipeline and dataset recommenders are feasible and beneficial to the sharing and usage of open-science resources. Future work will focus on the collection of more comprehensive provenance traces, and on deploying the system in production.

[1]  Krzysztof J. Gorgolewski,et al.  OpenNeuro – a free online platform for sharing and analysis of neuroimaging data , 2017 .

[2]  Debajyoti Mukhopadhyay,et al.  Matrix Factorization Model in Collaborative Filtering Algorithms: A Survey , 2015 .

[3]  Lucila Ohno-Machado,et al.  DATS, the data tag suite to enable discoverability of datasets , 2017, Scientific Data.

[4]  Tristan Glatard,et al.  Boutiques: a flexible framework to integrate command-line applications in computing platforms , 2018, GigaScience.

[5]  Michael J. Pazzani,et al.  Content-Based Recommendation Systems , 2007, The Adaptive Web.

[6]  D. Louis Collins,et al.  MINC 2.0: A Flexible Format for Multi-Modal Images , 2016, Front. Neuroinform..

[7]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[8]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[9]  Eyke Hüllermeier,et al.  Dyad ranking using Plackett–Luce models based on joint feature representations , 2018, Machine Learning.

[10]  Tiziana Margaria,et al.  Synthesis-Based Loose Programming , 2010, 2010 Seventh International Conference on the Quality of Information and Communications Technology.

[11]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Christopher Rorden,et al.  The first step for neuroimaging data analysis: DICOM to NIfTI conversion , 2016, Journal of Neuroscience Methods.

[13]  Bernhard Steffen,et al.  Loose Programming with PROPHETS , 2012, FASE.

[14]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[15]  David N. Kennedy,et al.  The NITRC image repository , 2016, NeuroImage.

[16]  Anna-Lena Lamprecht,et al.  User-Level Workflow Design , 2013, Lecture Notes in Computer Science.

[17]  Eyke Hüllermeier,et al.  Algorithm Selection as Recommendation: From Collaborative Filtering to Dyad Ranking , 2019 .

[18]  James Cheney,et al.  The W3C PROV family of specifications for modelling provenance metadata , 2013, EDBT '13.

[19]  Hans-Michael Müller,et al.  The Neuroscience Information Framework: A Data and Knowledge Environment for Neuroscience , 2008, Neuroinformatics.

[20]  Satrajit S. Ghosh,et al.  The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments , 2016, Scientific Data.

[21]  Paul T. Groth,et al.  Wings: Intelligent Workflow-Based Design of Computational Experiments , 2011, IEEE Intelligent Systems.

[22]  Yehuda Koren,et al.  Advances in Collaborative Filtering , 2011, Recommender Systems Handbook.

[23]  OpenNeuro: An open resource for sharing of neuroimaging data , 2021 .

[24]  Anna-Lena Lamprecht,et al.  Automated workflow composition in mass spectrometry-based proteomics , 2018, Bioinform..

[25]  Tristan Glatard,et al.  A Serverless Tool for Platform Agnostic Computational Experiment Management , 2018, Front. Neuroinform..

[26]  Michele Larobina,et al.  Medical Image File Formats , 2014, Journal of Digital Imaging.

[27]  Rolf Backofen,et al.  Tool recommender system in Galaxy using deep learning , 2019, bioRxiv.

[28]  Satrajit S. Ghosh,et al.  Sharing brain mapping statistical results with the neuroimaging data model , 2016, Scientific Data.

[29]  Weixuan Fu,et al.  Evaluating recommender systems for AI-driven biomedical informatics. , 2020, Bioinformatics.