Microbiome Learning Repo (ML Repo): A public repository of microbiome regression and classification tasks

Abstract The use of machine learning in high-dimensional biological applications, such as the human microbiome, has grown exponentially in recent years, but algorithm developers often lack the domain expertise required for interpretation and curation of the heterogeneous microbiome datasets. We present Microbiome Learning Repo (ML Repo, available at https://knights-lab.github.io/MLRepo/), a public, web-based repository of 33 curated classification and regression tasks from 15 published human microbiome datasets. We highlight the use of ML Repo in several use cases to demonstrate its wide application, and we expect it to be an important resource for algorithm developers.

[1]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[2]  L. Ursell,et al.  Gut Microbiomes of Malawian Twin Pairs Discordant for Kwashiorkor , 2013, Science.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Nitin Kumar,et al.  HPMCD: the database of human microbial communities from metagenomic datasets and microbial reference genomes , 2015, Nucleic Acids Res..

[5]  Fredrik H. Karlsson,et al.  Gut metagenome in European women with normal, impaired and diabetic glucose control , 2013, Nature.

[6]  Edoardo Pasolli,et al.  Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights , 2016, PLoS Comput. Biol..

[7]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[8]  Aleksandar Milosavljevic,et al.  A Metagenomic Approach to Characterization of the Vaginal Microbiome Signature in Pregnancy , 2012, PloS one.

[9]  Patrick D. Schloss,et al.  Looking for a Signal in the Noise: Revisiting Obesity and the Microbiome , 2016, mBio.

[10]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[11]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[14]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Paolo Manghi,et al.  Accessible, curated metagenomic data through ExperimentHub , 2017, Nature Methods.

[16]  Eric P. Nawrocki,et al.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea , 2011, The ISME Journal.

[17]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[18]  B. Birren,et al.  Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. , 2012, Genome research.

[19]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[20]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[21]  Nicholas A. Bokulich,et al.  mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking , 2016, mSystems.

[22]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[23]  Lawrence A. David,et al.  Diet rapidly and reproducibly alters the human gut microbiome , 2013, Nature.

[24]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[25]  Dan Knights,et al.  SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control , 2018, mSystems.

[26]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[27]  J. Clemente,et al.  Human gut microbiome viewed across age and geography , 2012, Nature.

[28]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[29]  Matthew Fraser,et al.  EBI metagenomics—a new resource for the analysis and archiving of metagenomic data , 2013, Nucleic Acids Res..

[30]  Rafael A. Irizarry,et al.  Meta-analysis of gut microbiome studies identifies disease-specific and shared responses , 2017, Nature Communications.