mAML: an automated machine learning pipeline with a microbiome repository for human disease classification

Due to the concerted efforts to utilize the microbial features to improve disease prediction capabilities, automated machine learning (AutoML) systems designed to get rid of the tediousness in manually performing ML tasks are in great demand. Here we developed mAML, an ML model-building pipeline, which can automatically and rapidly generate optimized and interpretable models for personalized microbial classification tasks in a reproducible way. The pipeline is deployed on a web-based platform and the server is user-friendly, flexible, and has been designed to be scalable according to the specific requirements. This pipeline exhibits high performance for 13 benchmark datasets including both binary and multi-class classification tasks. In addition, to facilitate the application of mAML and expand the human disease-related microbiome learning repository, we developed GMrepo ML repository (GMrepo Microbiome Learning repository) from the GMrepo database. The repository involves 120 microbial classification tasks for 85 human-disease phenotypes referring to 12,429 metagenomic samples and 38,643 amplicon samples. The mAML pipeline and the GMrepo ML repository are expected to be important resources for researches in microbiology and algorithm developments. Database URL http://39.100.246.211:8050/Home

[1]  Paul Theodor Pyl,et al.  Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer , 2019, Nature Medicine.

[2]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Jacques Ravel,et al.  Microbiome, demystifying the role of microbial communities in the biosphere , 2013, Microbiome.

[5]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.

[6]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[7]  Andreas Henschel,et al.  Taxonomy-aware feature engineering for microbiome classification , 2018, BMC Bioinformatics.

[8]  Dan Knights,et al.  Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks , 2019, GigaScience.

[9]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[10]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[11]  Edoardo Pasolli,et al.  Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights , 2016, PLoS Comput. Biol..

[12]  Ionas Erb,et al.  Interpretable Log Contrasts for the Classification of Health Biomarkers: a New Approach to Balance Selection , 2020, mSystems.

[13]  Alexander Statnikov,et al.  A comprehensive evaluation of multicategory classification methods for microbiomic data , 2013, Microbiome.

[14]  Joana Damas,et al.  A near-chromosome-scale genome assembly of the gemsbok (Oryx gazella): an iconic antelope of the Kalahari desert , 2019, GigaScience.

[15]  Aaron Klein,et al.  Towards Automatically-Tuned Deep Neural Networks , 2019, Automated Machine Learning.

[16]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[17]  Xing-Ming Zhao,et al.  GMrepo: a database of curated and consistently annotated human gut metagenomes , 2019, Nucleic Acids Res..

[18]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[19]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[20]  Susan P. Holmes,et al.  Shiny-phyloseq: Web application for interactive microbiome analysis with provenance tracking , 2014, Bioinform..

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Jenna Wiens,et al.  A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems , 2019, mBio.

[23]  T. Thomas,et al.  Predicting the HMA-LMA Status in Marine Sponges by Machine Learning , 2017, Front. Microbiol..

[24]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[25]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..