Accurate Profiling of Microbial Communities from Massively Parallel Sequencing Using Convex Optimization

We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which is fundamental for microbiome analysis. In this problem, the goal is to reconstruct the identity and frequency of species comprising a microbial community, using short sequence reads from Massively Parallel Sequencing (MPS) data obtained for specified genomic regions. We formulate the problem mathematically as a convex optimization problem and provide sufficient conditions for identifiability, namely the ability to reconstruct species identity and frequency correctly when the data size (number of reads) grows to infinity. We discuss different metrics for assessing the quality of the reconstructed solution, including a novel phylogenetically-aware metric based on the Mahalanobis distance, and give upper-bounds on the reconstruction error for a finite number of reads under different metrics. We propose a scalable divide-and-conquer algorithm for the problem using convex optimization, which enables us to handle large problems (with $\sim10^6$ species). We show using numerical simulations that for realistic scenarios, where the microbial communities are sparse, our algorithm gives solutions with high accuracy, both in terms of obtaining accurate frequency, and in terms of species phylogenetic resolution.

[1]  Li C. Xia,et al.  Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads , 2011, PloS one.

[2]  D. Chessel,et al.  From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. , 2004, Journal of theoretical biology.

[3]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[4]  R. Knight,et al.  Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. , 2009, Genome research.

[5]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[6]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[7]  Jizhong Zhou,et al.  Microarray Applications in Microbial Ecology Research , 2006, Microbial Ecology.

[8]  R. Knight,et al.  Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities , 2007, Applied and Environmental Microbiology.

[9]  E. Purdom,et al.  Diversity of the Human Intestinal Microbial Flora , 2005, Science.

[10]  John Novembre,et al.  Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data , 2012, Molecular biology and evolution.

[11]  Andrey Tovchigrechko,et al.  High-speed microbial community profiling , 2012, Nature Methods.

[12]  Amnon Amir,et al.  Bacterial Community Reconstruction Using Compressed Sensing , 2011, RECOMB.

[13]  Venkat Chandrasekaran,et al.  Recovery of Sparse Probability Measures via Convex Programming , 2012, NIPS.

[14]  Peter Meinicke,et al.  Mixture models for analysis of the taxonomic composition of metagenomes , 2011, Bioinform..

[15]  Eran Halperin,et al.  eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data , 2013, RECOMB.

[16]  Susan M. Huse,et al.  Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing , 2008, PLoS genetics.

[17]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[18]  Wing Hung Wong,et al.  Identifiability of isoform deconvolution from junction arrays and RNA-Seq , 2009, Bioinform..

[19]  Ohad Shamir,et al.  High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions , 2013, Nucleic acids research.

[20]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[21]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[22]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[23]  F. Dewhirst,et al.  Bacterial Diversity in Human Subgingival Plaque , 2001, Journal of bacteriology.

[24]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..