Consistent metagenomic biomarker detection via robust PCA

BackgroundRecent developments of high throughput sequencing technologies allow the characterization of the microbial communities inhabiting our world. Various metagenomic studies have suggested using microbial taxa as potential biomarkers for certain diseases. In practice, the number of available samples varies from experiment to experiment. Therefore, a robust biomarker detection algorithm is needed to provide a set of potential markers irrespective of the number of available samples. Consistent performance is essential to derive solid biological conclusions and to transfer these findings into clinical applications. Surprisingly, the consistency of a metagenomic biomarker detection algorithm with respect to the variation in the experiment size has not been addressed by the current state-of-art algorithms.ResultsWe propose a consistency-classification framework that enables the assessment of consistency and classification performance of a biomarker discovery algorithm. This evaluation protocol is based on random resampling to mimic the variation in the experiment size. Moreover, we model the metagenomic data matrix as a superposition of two matrices. The first matrix is a low-rank matrix that models the abundance levels of the irrelevant bacteria. The second matrix is a sparse matrix that captures the abundance levels of the bacteria that are differentially abundant between different phenotypes. Then, we propose a novel Robust Principal Component Analysis (RPCA) based biomarker discovery algorithm to recover the sparse matrix. RPCA belongs to the class of multivariate feature selection methods which treat the features collectively rather than individually. This provides the proposed algorithm with an inherent ability to handle the complex microbial interactions. Comprehensive comparisons of RPCA with the state-of-the-art algorithms on two realistic datasets are conducted. Results show that RPCA consistently outperforms the other algorithms in terms of classification accuracy and reproducibility performance.ConclusionsThe RPCA-based biomarker detection algorithm provides a high reproducibility performance irrespective of the complexity of the dataset or the number of selected biomarkers. Also, RPCA selects biomarkers with quite high discriminative accuracy. Thus, RPCA is a consistent and accurate tool for selecting taxanomical biomarkers for different microbial populations.ReviewersThis article was reviewed by Masanori Arita and Zoltan Gaspari.

[1]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[2]  Tarah Lynch,et al.  Invasive potential of gut mucosa‐derived fusobacterium nucleatum positively correlates with IBD status of the host , 2011, Inflammatory bowel diseases.

[3]  J. Raes,et al.  Microbial interactions: from networks to models , 2012, Nature Reviews Microbiology.

[4]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[5]  Daniel Segrè,et al.  Environments that Induce Synthetic Microbial Ecosystems , 2010, PLoS Comput. Biol..

[6]  Maiko Sasaki,et al.  The Role of Bacteria in the Pathogenesis of Ulcerative Colitis , 2012, Journal of signal transduction.

[7]  David A. Relman,et al.  Microbiology: Learning about who we are , 2012, Nature.

[8]  P. Lepage,et al.  Transcriptional activity of the dominant gut mucosal microbiota in chronic inflammatory bowel disease patients. , 2010, Journal of medical microbiology.

[9]  C. Huttenhower,et al.  Metagenomic biomarker discovery and explanation , 2011, Genome Biology.

[10]  J. Clemente,et al.  Gut Microbiota from Twins Discordant for Obesity Modulate Metabolism in Mice , 2013, Science.

[11]  J. Labov,et al.  Metagenomics: a call for bringing a new science into the classroom (while it's still new). , 2007, CBE life sciences education.

[12]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[13]  S. Mazmanian,et al.  Disruption of the gut microbiome as a risk factor for microbial infections. , 2013, Current opinion in microbiology.

[14]  H. Flint Obesity and the Gut Microbiota , 2011, Journal of clinical gastroenterology.

[15]  S. Sørensen,et al.  Gut Microbiota in Human Adults with Type 2 Diabetes Differs from Non-Diabetic Adults , 2010, PloS one.

[16]  Richard Simon,et al.  Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n) , 2003, SKDD.

[17]  Timothy L. Tickle,et al.  Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment , 2012, Genome Biology.

[18]  Martin Wiedmann,et al.  Culture independent analysis of ileal mucosa reveals a selective increase in invasive Escherichia coli of novel phylogeny relative to depletion of Clostridiales in Crohn's disease involving the ileum , 2007, The ISME Journal.

[19]  G. Rossi,et al.  Comparison of Microbiological, Histological, and Immunomodulatory Parameters in Response to Treatment with Either Combination Therapy with Prednisone and Metronidazole or Probiotic VSL#3 Strains in Dogs with Idiopathic Inflammatory Bowel Disease , 2014, PloS one.

[20]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[21]  V. Bucci,et al.  The Evolution of Bacteriocin Production in Bacterial Biofilms , 2011, The American Naturalist.

[22]  Jo Handelsman,et al.  Metagenomics for studying unculturable microorganisms: cutting the Gordian knot , 2005, Genome Biology.

[23]  J. Goedert,et al.  Human gut microbiome and risk for colorectal cancer. , 2013, Journal of the National Cancer Institute.

[24]  W E Moore,et al.  Intestinal floras of populations that have a high risk of colon cancer , 1995, Applied and environmental microbiology.

[25]  Xi Jin,et al.  Association between Helicobacter Pylori Infection and Ulcerative Colitis-A Case Control Study from China , 2013, International journal of medical sciences.

[26]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[27]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[28]  Caroline C. Friedel,et al.  Reliable gene signatures for microarray classification: assessment of stability and performance , 2006, Bioinform..

[29]  I. Poxton,et al.  Mucosa-associated bacterial flora of the human colon. , 1997, Journal of medical microbiology.

[30]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[31]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[32]  Habtom W. Ressom,et al.  Particle swarm optimization for analysis of mass spectral serum profiles , 2005, GECCO '05.

[33]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[34]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[35]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[36]  Arvind Ganesh,et al.  Fast Convex Optimization Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix , 2009 .

[37]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[38]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[39]  K. B. McKusick,et al.  High-throughput gene mapping in Caenorhabditis elegans. , 2002, Genome research.

[40]  Manfred Dietel,et al.  Mucosal flora in inflammatory bowel disease. , 2002, Gastroenterology.

[41]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[42]  Alison Abbott,et al.  Scientists bust myth that our bodies have more bacteria than human cells , 2016, Nature.

[43]  J. Goedert,et al.  Abstract 2290: Human gut microbiome and risk of colorectal cancer, a case-control study. , 2013 .

[44]  J. Lidbury,et al.  A dysbiosis index to assess microbial changes in fecal samples of dogs with chronic inflammatory enteropathy , 2017, FEMS microbiology ecology.

[45]  John Wright,et al.  Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization , 2009, NIPS.

[46]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[47]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[48]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[49]  Arvind Ganesh,et al.  Fast algorithms for recovering a corrupted low-rank matrix , 2009, 2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[50]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[51]  Yong Xu,et al.  Robust PCA based method for discovering differentially expressed genes , 2013, BMC Bioinformatics.

[52]  L. van Lieshout,et al.  Pouchitis: result of microbial imbalance? , 1994, Gut.

[53]  John C. Wooley,et al.  Metagenomics: Facts and Artifacts, and Computational Challenges , 2010, Journal of Computer Science and Technology.

[54]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.