metaSPARSim: a 16S rRNA gene sequencing count data simulator

In the last few years, 16S rRNA gene sequencing (16S rDNA-seq) has seen a surprisingly rapid increase in election rate as a methodology to perform microbial community studies. Despite the considerable popularity of this technique, an exiguous number of specific tools are currently available for proper 16S rDNA-seq count data preprocessing and simulation. Indeed, the great majority of tools have been developed adapting methodologies previously used for bulk RNA-seq data, with poor assessment of their applicability in the metagenomics field. For such tools and the few ones specifically developed for 16S rDNA-seq data, performance assessment is challenging, mainly due to the complex nature of the data and the lack of realistic simulation models. In fact, to the best of our knowledge, no software thought for data simulation are available to directly obtain synthetic 16S rDNA-seq count tables that properly model heavy sparsity and compositionality typical of these data. In this paper we present metaSPARSim, a sparse count matrix simulator intended for usage in development of 16S rDNA-seq metagenomic data processing pipelines. metaSPARSim implements a new generative process that models the sequencing process with a Multivariate Hypergeometric distribution in order to realistically simulate 16S rDNA-seq count table, resembling real experimental data compositionality and sparsity. It provides ready-to-use count matrices and comes with the possibility to reproduce different pre-coded scenarios and to estimate simulation parameters from real experimental data. The tool is made available at http://sysbiobig.dei.unipd.it/?q=Software#metaSPARSimand https://gitlab.com/sysbiobig/metasparsim. metaSPARSim is able to generate count matrices resembling real 16S rDNA-seq data. The availability of count data simulators is extremely valuable both for methods developers, for which a ground truth for tools validation is needed, and for users who want to assess state of the art analysis tools for choosing the most accurate one. Thus, we believe that metaSPARSim is a valuable tool for researchers involved in developing, testing and using robust and reliable data analysis methods in the context of 16S rRNA gene sequencing.

[1]  T. Lowe,et al.  General concepts for PCR primer design. , 1993, PCR methods and applications.

[2]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[3]  Hongzhe Li,et al.  A Logistic Normal Multinomial Regression Model for Microbiome Compositional Data Analysis , 2013, Biometrics.

[4]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[5]  Thomas P. Quinn,et al.  Understanding sequencing data as compositions: an outlook and review , 2017, bioRxiv.

[6]  Jun Chen,et al.  An omnibus test for differential distribution analysis of microbiome sequencing data , 2018, Bioinform..

[7]  V. Pawlowsky-Glahn,et al.  Modelling and Analysis of Compositional Data: Pawlowsky-Glahn/Modelling and Analysis of Compositional Data , 2015 .

[8]  C. Tebbe,et al.  Effect of Primers Hybridizing to Different Evolutionarily Conserved Regions of the Small-Subunit rRNA Gene in PCR-Based Microbial Community Analyses and Genetic Profiling , 2001, Applied and Environmental Microbiology.

[9]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[10]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[11]  U. Göbel,et al.  Phylogenetic analysis of pathogen-related oral spirochetes , 1996, Journal of clinical microbiology.

[12]  Jack A Gilbert,et al.  Community ecology as a framework for human microbiome research , 2019, Nature Medicine.

[13]  Stefano Toppo,et al.  Optimizing PCR primers targeting the bacterial 16S ribosomal RNA gene , 2018, BMC Bioinformatics.

[14]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[15]  Jesse R. Zaneveld,et al.  Normalization and microbial differential abundance strategies depend upon data characteristics , 2017, Microbiome.

[16]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, bioRxiv.

[17]  T. Watson,et al.  Molecular Analysis of the Microflora Associated with Dental Caries , 2004, Journal of Clinical Microbiology.

[18]  Wei Xu,et al.  Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data , 2015, PloS one.

[19]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[20]  Luc Bijnens,et al.  A broken promise: microbiome differential abundance methods do not control the false discovery rate , 2019, Briefings Bioinform..

[21]  Davis J. McCarthy,et al.  A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor , 2016, F1000Research.

[22]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[23]  Hongzhe Li,et al.  VARIABLE SELECTION FOR SPARSE DIRICHLET-MULTINOMIAL REGRESSION WITH AN APPLICATION TO MICROBIOME DATA ANALYSIS. , 2013, The annals of applied statistics.

[24]  J. Mullahy Specification and testing of some modified count data models , 1986 .

[25]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[26]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[27]  Richard Bonneau,et al.  Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing , 2016 .

[28]  Jun Wang,et al.  Quantitative microbiome profiling links gut community variation to microbial load , 2017, Nature.

[29]  P. Qian,et al.  Conservative Fragments in Bacterial 16S rRNA Genes and Primer Design for 16S Ribosomal DNA Amplicons in Metagenomic Studies , 2009, PloS one.

[30]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.