Tool recommender system in Galaxy using deep learning

Galaxy is a web-based and open-source scientific data-processing platform. Researchers compose pipelines in Galaxy to analyse scientific data. These pipelines, also known as workflows, can be complex and difficult to create from thousands of tools, especially for researchers new to Galaxy. To make creating workflows easier, faster and less error-prone, a predictive system is developed to recommend tools facilitating further analysis. A model is created to recommend tools by analysing workflows, composed by researchers on the European Galaxy server, using a deep learning approach. The higher-order dependencies in workflows, represented as directed acyclic graphs, are learned by training a gated recurrent units (GRU) neural network, a variant of a recurrent neural network (RNN). The weights of tools used in the neural network training are derived from their usage frequencies over a period of time. The hyper-parameters of the neural network are optimised using Bayesian optimisation. An accuracy of 97% in predicting tools is achieved by the model for precision@1, precision@2 and precision@3 metrics. It is accessed by a Galaxy API to recommend tools in real-time. Multiple user interface (UI) integrations on the server communicate with this API to apprise researchers of these recommended tools interactively. Contact kumara@informatik.uni-freiburg.de gruening@informatik.uni-freiburg.de backofen@informatik.uni-freiburg.de

[1]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[2]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[3]  A. P. deVries,et al.  A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems , 2013 .

[4]  Charles Elkan,et al.  Learning to Diagnose with LSTM Recurrent Neural Networks , 2015, ICLR.

[5]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[6]  Nicola J. Mulder,et al.  Developing reproducible bioinformatics analysis workflows for heterogeneous computing environments to support African genomics , 2018, BMC Bioinformatics.

[7]  F. McCoy,et al.  Janus-faced PIDD: a sensor for DNA damage-induced cell death or survival? , 2012, Molecular cell.

[8]  Wojciech Czarnecki,et al.  On Loss Functions for Deep Neural Networks in Classification , 2017, ArXiv.

[9]  CARLOS A. GOMEZ-URIBE,et al.  The Netflix Recommender System , 2015, ACM Trans. Manag. Inf. Syst..

[10]  Jaroslaw Zola,et al.  Exact structure learning of Bayesian networks by optimal path extension , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[11]  Richard Scheines,et al.  Constructing Bayesian Network Models of Gene Expression Networks from Microarray Data , 2000 .

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Bernhard Steffen,et al.  Loose Programming with PROPHETS , 2012, FASE.

[15]  S. Andrews,et al.  Cluster Flow: A user-friendly bioinformatics workflow tool , 2016, F1000Research.

[16]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[17]  Greg Linden,et al.  Two Decades of Recommender Systems at Amazon.com , 2017, IEEE Internet Computing.

[18]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[19]  Yolanda Gil,et al.  Semantic workflows for benchmark challenges: Enhancing comparability, reusability and reproducibility , 2018, PSB.

[20]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[21]  Roland Memisevic,et al.  Modeling sequential data using higher-order relational features and predictive training , 2014, ArXiv.

[22]  George Karypis,et al.  Item-based top-N recommendation algorithms , 2004, TOIS.

[23]  Jöran Beel,et al.  Scienstein : A Research Paper Recommender System , 2009 .

[24]  Rachel Pottinger,et al.  Semi-automatic web service composition for the life sciences using the BioMoby semantic web framework , 2008, J. Biomed. Informatics.

[25]  Nitesh V. Chawla,et al.  Representing higher-order dependencies in networks , 2015, Science Advances.

[26]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[27]  Razvan Pascanu,et al.  Understanding the exploding gradient problem , 2012, ArXiv.

[28]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[29]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[30]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[31]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[32]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[33]  Zhao Kang,et al.  Top-N Recommender System via Matrix Completion , 2016, AAAI.

[34]  Anna-Lena Lamprecht,et al.  Automated workflow composition in mass spectrometry-based proteomics , 2018, Bioinform..