Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses

The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should test a variety of conditions to achieve optimal results.

[1]  M. Wiedmann,et al.  Comparative genomics of the bacterial genus Listeria: Genome evolution is characterized by limited gene acquisition and limited gene loss , 2010, BMC Genomics.

[2]  J. Long,et al.  Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data , 2012, BMC Genomics.

[3]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[4]  Daniel J. Wilson,et al.  Insights from Genomics into Bacterial Pathogen Populations , 2012, PLoS pathogens.

[5]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[6]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[7]  D. Ussery,et al.  Genome Sequencing Identifies Two Nearly Unchanged Strains of Persistent Listeria monocytogenes Isolated at Two Different Fish Processing Plants Sampled 6 Years Apart , 2013, Applied and Environmental Microbiology.

[8]  M. Vergassola,et al.  The Listeria transcriptional landscape from saprophytism to virulence , 2009, Nature.

[9]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[10]  Alex Wong,et al.  Evolutionary insight from whole‐genome sequencing of experimentally evolved microbes , 2012, Molecular ecology.

[11]  A. Boyko,et al.  SNP identification, verification, and utility for population genetics in a non-model genus , 2010, BMC Genetics.

[12]  Matthew C. Fisher,et al.  Using False Discovery Rates to Benchmark SNP-callers in next-generation sequencing projects , 2013, Scientific Reports.

[13]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[14]  B. Birren,et al.  Short-term genome evolution of Listeria monocytogenes in a non-controlled environment , 2008, BMC Genomics.

[15]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[16]  Dan Graur,et al.  Characterization of pairwise and multiple sequence alignment errors. , 2009, Gene.

[17]  M. Gilmour,et al.  Sequence Typing Confirms that a Predominant Listeria monocytogenes Clone Caused Human Listeriosis Cases and Outbreaks in Canada from 1988 to 2010 , 2012, Journal of Clinical Microbiology.

[18]  Wei Qian,et al.  Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. , 2000, Molecular biology and evolution.

[19]  L. Ponnala,et al.  Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs , 2009, BMC Genomics.

[20]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[21]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[22]  N. Colegrave,et al.  Next‐generation sequencing as a tool to study microbial evolution , 2011, Molecular ecology.

[23]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[24]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[25]  E. Domann,et al.  Intracellular Gene Expression Profile of Listeria monocytogenes , 2006, Infection and Immunity.

[26]  A. Goesmann,et al.  Reassessment of the Listeria monocytogenes pan-genome reveals dynamic integration hotspots and mobile genetic elements as major components of the accessory genome , 2013, BMC Genomics.

[27]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[28]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[29]  N. Loman,et al.  High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity , 2012, Nature Reviews Microbiology.

[30]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[31]  P. Peterkin,et al.  Listeria monocytogenes, a food-borne pathogen , 1991, Microbiological reviews.

[32]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[33]  M. Waterman,et al.  Comparative biosequence metrics , 2005, Journal of Molecular Evolution.

[34]  G. Dykes,et al.  An SNP-based PCR assay to differentiate between Listeria monocytogenes lineages derived from phylogenetic analysis of the sigB gene. , 2003, Journal of microbiological methods.

[35]  A. Futschik,et al.  PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals , 2011, PloS one.

[36]  P. Gerner-Smidt,et al.  Genomic Characterization of Listeria monocytogenes Strains Involved in a Multistate Listeriosis Outbreak Associated with Cantaloupe in US , 2012, PloS one.

[37]  P. Courvalin Antimicrobial Drug Resistance: "Prediction Is Very Difficult, Especially about the Future" , 2005, Emerging infectious diseases.

[38]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.