A Bayes testing approach to metagenomic profiling in bacteria

Using next generation sequencing (NGS) data, we use a multinomial with a Dirichlet prior to detect the presence of bacteria in a metagenomic sample via marginal Bayes testing for each bacterial strain. The NGS reads per strain are counted fractionally with each read contributing an equal amount to each strain it might represent. The threshold for detection is strain-dependent and we apply a correction for the dependence amongst the (NGS) reads by finding the knee in a curve representing a tradeoff between detecting too many strains and not enough strains. As a check, we evaluate the joint posterior probabilities for the presence of two strains of bacteria and find relatively little dependence. We apply our techniques to two data sets and compare our results with the results found by the Human Microbiome Project. We conclude with a discussion of the issues surrounding multiple corrections in a Bayes context.

[1]  P. Hemarajata,et al.  The human gut microbiome and body metabolism: implications for obesity and diabetes. , 2013, Clinical chemistry.

[2]  W. David Kelton,et al.  Statistical design and analysis , 1986, WSC '86.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  C. Huttenhower,et al.  PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes , 2013, Nature Communications.

[5]  Dean Phillips Foster,et al.  Calibration and Empirical Bayes Variable Selection , 1997 .

[6]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[7]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[8]  Katherine H. Huang,et al.  The Human Microbiome Project: A Community Resource for the Healthy Human Microbiome , 2012, PLoS biology.

[9]  Juan Manuel Ramírez-Cortés,et al.  Mathematical Model for the Optimal Utilization Percentile in M/M/1 Systems: A Contribution about Knees in Performance Curves , 2011, ArXiv.

[10]  Mark J. Bailey,et al.  TerraGenome: a consortium for the sequencing of a soil metagenome , 2009, Nature Reviews Microbiology.

[11]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[12]  Stephen J. Salipante,et al.  Rapid 16S rRNA Next-Generation Sequencing of Polymicrobial Clinical Samples for Diagnosis of Complex Bacterial Infections , 2013, PloS one.

[13]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[14]  Jennifer Clarke,et al.  Statistical expression deconvolution from mixed tissue samples , 2010, Bioinform..

[15]  G. B. Schaalje,et al.  unassembled sequencing data : Species identification and strain attribution with Pathoscope Material Supplemental , 2013 .

[16]  James G. Scott,et al.  Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem , 2010, 1011.2333.

[17]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[18]  R. Doerge,et al.  Statistical Design and Analysis of RNA Sequencing Data , 2010, Genetics.

[19]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[20]  Craig W. Hedberg,et al.  Foodborne Illness Acquired in the United States , 2011, Emerging infectious diseases.

[21]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[22]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[23]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[24]  Susmita Datta,et al.  Statistical Analyses of Next Generation Sequence Data: A Partial Overview. , 2010, Journal of proteomics & bioinformatics.

[25]  B. D. Finetti La prévision : ses lois logiques, ses sources subjectives , 1937 .

[26]  P. Bork,et al.  Accurate and universal delineation of prokaryotic species , 2013, Nature Methods.

[27]  D. Berry,et al.  Bayesian perspectives on multiple comparisons , 1999 .

[28]  D. Freedman,et al.  BAYES' METHOD FOR BOOKIES , 1969 .

[29]  P. Müller,et al.  A Bayesian discovery procedure , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[30]  Peter Williams,et al.  IMG: the integrated microbial genomes database and comparative analysis system , 2011, Nucleic Acids Res..

[31]  Andrew Gelman,et al.  Why We (Usually) Don't Have to Worry About Multiple Comparisons , 2009, 0907.2478.

[32]  James G. Scott,et al.  An exploration of aspects of Bayesian multiple testing , 2006 .

[33]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[34]  S. Salzberg,et al.  PhymmBL expanded: confidence scores, custom databases, parallelization and more , 2011, Nature Methods.