Accessible, Reproducible, and Scalable Machine Learning for Biomedicine

Supervised machine learning, where the goal is to predict labels of new instances by training on labeled data, has become an essential tool in biomedical data analysis. To make supervised machine learning more accessible to biomedical scientists, we have developed Galaxy-ML, a platform that enables scientists to perform end-to-end reproducible machine learning analyses at large scale using only a web browser. Galaxy-ML extends Galaxy, a biomedical computational workbench used by tens of thousands of scientists across the world, with a machine learning tool suite that supports end-to-end analysis.

[1]  Anton Nekrutenko,et al.  Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[2]  Diogo M. Camacho,et al.  Next-Generation Machine Learning for Biological Networks , 2018, Cell.

[3]  Sebastian Raschka,et al.  MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack , 2018, J. Open Source Softw..

[4]  Laura M. Heiser,et al.  How Machine Learning Will Transform Biomedicine , 2020, Cell.

[5]  Christopher. Simons,et al.  Machine learning with Python , 2017 .

[6]  Robert P. Dobrow,et al.  Introduction and Review , 2016 .

[7]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[8]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[9]  Matthias Schmid,et al.  Predicting CYP2D6 phenotype from resting brain perfusion images by gradient boosting , 2017, Psychiatry Research: Neuroimaging.

[10]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[11]  Benjamin Hofner,et al.  An Update on Statistical Boosting in Biomedicine , 2017, Comput. Math. Methods Medicine.

[12]  Randal S. Olson,et al.  Relief-Based Feature Selection: Introduction and Review , 2017, J. Biomed. Informatics.

[13]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[14]  Mohammed AlQuraishi,et al.  AlphaFold at CASP13 , 2019, Bioinform..

[15]  Anton Nekrutenko,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update , 2020, Nucleic Acids Res..

[16]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Laura M. Heiser,et al.  A community effort to assess and improve drug sensitivity prediction algorithms , 2014, Nature Biotechnology.

[19]  Justin Guinney,et al.  Systematic Assessment of Analytical Methods for Drug Sensitivity Prediction from Cancer Cell Line Data , 2013, Pacific Symposium on Biocomputing.

[20]  Saket Navlakha,et al.  Predicting age from the transcriptome of human dermal fibroblasts , 2018, Genome Biology.

[21]  M. Hutson Artificial intelligence faces reproducibility crisis. , 2018, Science.

[22]  Nci Dream Community A community effort to assess and improve drug sensitivity prediction algorithms , 2014 .

[23]  T. Coroller,et al.  Deep Learning Predicts Lung Cancer Treatment Response from Serial Medical Imaging , 2019, Clinical Cancer Research.

[24]  Joshua M. Korn,et al.  Next-generation characterization of the Cancer Cell Line Encyclopedia , 2019, Nature.

[25]  S. A. R. Boldy,et al.  Introduction and review , 1999 .

[26]  Evan M. Cofer,et al.  Selene: a PyTorch-based deep learning library for sequence data , 2019, Nature Methods.

[27]  H. Hoefsloot,et al.  Chronological age prediction based on DNA methylation: Massive parallel sequencing and random forest regression. , 2017, Forensic science international. Genetics.

[28]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[29]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[30]  Randal S. Olson,et al.  Data-driven advice for applying machine learning to bioinformatics problems , 2017, PSB.

[31]  David G. Knowles,et al.  Predicting Splicing from Primary Sequence with Deep Learning , 2019, Cell.