NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model

BackgroundPacBio sequencing platform offers longer read lengths than the second-generation sequencing technologies. It has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. Due to its extremely wide range of application areas, fast sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of subsequent analysis tools. Although there are several available simulators (e.g., PBSIM, SimLoRD and FASTQSim) that target the specific generation of PacBio libraries, the error rate of simulated sequences is not well matched to the quality value of raw PacBio datasets, especially for PacBio’s continuous long reads (CLR).ResultsBy analyzing the characteristic features of CLR data from PacBio SMRT (single molecule real time) sequencing, we developed a new PacBio sequencing simulator (called NPBSS) for producing CLR reads. NPBSS simulator firstly samples the read sequences according to the read length logarithmic normal distribution, and choses different base quality values with different proportions. Then, NPBSS computes the overall error probability of each base in the read sequence with an empirical model, and calculates the deletion, substitution and insertion probabilities with the overall error probability to generate the PacBio CLR reads. Alignment results demonstrate that NPBSS fits the error rate of the PacBio CLR reads better than PBSIM and FASTQSim. In addition, the assembly results also show that simulated sequences of NPBSS are more like real PacBio CLR data.ConclusionNPBSS simulator is convenient to use with efficient computation and flexible parameters setting. Its generating PacBio CLR reads are more like real PacBio datasets.

[1]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[2]  Akino Shiroma,et al.  Advantages of genome sequencing by long-read sequencer using SMRT technology in medical area , 2017, Human Cell.

[3]  Wirulda Pootakham,et al.  High resolution profiling of coral-associated bacterial communities using full-length 16S rRNA sequence data from PacBio SMRT sequencing system , 2017, Scientific Reports.

[4]  Rod A Wing,et al.  Aluminum tolerance in maize is associated with higher MATE1 gene copy number , 2013, Proceedings of the National Academy of Sciences.

[5]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[6]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[7]  Shao-Wu Zhang,et al.  DBH: A de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs. , 2017, Journal of theoretical biology.

[8]  J. Rothberg,et al.  The development and impact of 454 sequencing , 2008, Nature Biotechnology.

[9]  R. Norman,et al.  Microbial phylogenetic profiling with the Pacific Biosciences sequencing platform , 2013, Microbiome.

[10]  Chaochun Wei,et al.  NeSSM: A Next-Generation Sequencing Simulator for Metagenomics , 2013, PloS one.

[11]  Anna Shcherbina,et al.  FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets , 2014, BMC Research Notes.

[12]  Shao-Wu Zhang,et al.  MtHc: a motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs. , 2015, Molecular bioSystems.

[13]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[14]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  Sara B. Linker,et al.  Comparison of Three Targeted Enrichment Strategies on the SOLiD Sequencing Platform , 2011, PloS one.

[17]  Shao-Wu Zhang,et al.  DMclust, a Density‐based Modularity Method for Accurate OTU Picking of 16S rRNA Sequences , 2017, Molecular informatics.

[18]  Timothy D. Harris,et al.  The challenges of sequencing by synthesis , 2009, Nature Biotechnology.

[19]  Bjarne Knudsen,et al.  A Computer Simulator for Assessing Different Challenges and Strategies of de Novo Sequence Assembly , 2010, Genes.

[20]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[21]  Sven Rahmann,et al.  SimLoRD: Simulation of Long Read Data , 2016, Bioinform..

[22]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[23]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[24]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[25]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[26]  Robert Stephens,et al.  A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS. , 2013, Journal of data mining in genomics & proteomics.

[27]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[28]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[29]  T. Dreher,et al.  Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture , 2017, Standards in genomic sciences.

[30]  B. Haas,et al.  Premetazoan genome evolution and the regulation of cell differentiation in the choanoflagellate Salpingoeca rosetta , 2013, Genome Biology.

[31]  D. Kwiatkowski,et al.  Optimizing illumina next-generation sequencing library preparation for extremely at-biased genomes , 2012, BMC Genomics.

[32]  Matthew B. Kerby,et al.  Landscape of next-generation sequencing technologies. , 2011, Analytical chemistry.