SnpFilt: A pipeline for reference-free assembly-based identification of SNPs in bacterial genomes

De novo assembly of bacterial genomes from next-generation sequencing (NGS) data allows a reference-free discovery of single nucleotide polymorphisms (SNP). However, substantial rates of errors in genomes assembled by this approach remain a major barrier for the reference-free analysis of genome variations in medically important bacteria. The aim of this report was to improve the quality of SNP identification in bacterial genomes without closely related references. We developed a bioinformatics pipeline (SnpFilt) that constructs an assembly using SPAdes and then removes unreliable regions based on the quality and coverage of re-aligned reads at neighbouring regions. The performance of the pipeline was compared against reference-based SNP calling for Illumina HiSeq, MiSeq and NextSeq reads from a range of bacterial pathogens including Salmonella, which is one of the most common causes of food-borne disease. The SnpFilt pipeline removed all false SNP in all test NGS datasets consisting of paired-end Illumina reads. We also showed that for reliable and complete SNP calls, at least 40-fold coverage is required. Analysis of bacterial isolates associated with epidemiologically confirmed outbreaks using the SnpFilt pipeline produced results consistent with previously published findings. The SnpFilt pipeline improves the quality of de-novo assembly and precision of SNP calling in bacterial genomes by removal of regions of the assembly that may potentially contain assembly errors. SnpFilt is available from https://github.com/LanLab/SnpFilt.

[1]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[2]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[3]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[4]  Zhong Wang,et al.  ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies , 2013, Bioinform..

[5]  Mark M. Tanaka,et al.  Delineating Community Outbreaks of Salmonella enterica Serovar Typhimurium by Use of Whole-Genome Sequencing: Insights into Genomic Variability within an Outbreak , 2015, Journal of Clinical Microbiology.

[6]  Steven L Salzberg,et al.  Detection and correction of false segmental duplications caused by genome mis-assembly , 2010, Genome Biology.

[7]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[8]  Mihai Pop,et al.  Assembly complexity of prokaryotic genomes using short reads , 2010, BMC Bioinformatics.

[9]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[10]  Rayan Chikhi,et al.  Reference-free detection of isolated SNPs , 2014, Nucleic acids research.

[11]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[12]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[13]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[14]  B. Birren,et al.  Genome Project Standards in a New Era of Sequencing , 2009, Science.

[15]  Miriam L. Land,et al.  Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences , 2014, Bioinform..

[16]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[17]  Richard M Leggett,et al.  Reference-free SNP detection: dealing with the data deluge , 2014, BMC Genomics.

[18]  Alberto Magi,et al.  Read count approach for DNA copy number variants detection , 2012, Bioinform..

[19]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[20]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[21]  Gillian Hall,et al.  Estimating the burden of acute gastroenteritis, foodborne disease, and pathogens commonly transmitted by food: an international review. , 2005, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[22]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[23]  Barry G. Hall,et al.  When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes , 2013, PloS one.

[24]  Stefan Niemann,et al.  Whole-Genome-Based Mycobacterium tuberculosis Surveillance: a Standardized, Portable, and Expandable Approach , 2014, Journal of Clinical Microbiology.

[25]  N. Ricker,et al.  The limitations of draft assemblies for understanding prokaryotic adaptation and evolution. , 2012, Genomics.

[26]  Arthur W. Pightling,et al.  Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses , 2014, PloS one.

[27]  Steven Salzberg,et al.  GAGE-B: an evaluation of genome assemblers for bacterial organisms , 2013, Bioinform..

[28]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[29]  Z. Iqbal,et al.  Rapid Whole-Genome Sequencing for Surveillance of Salmonella enterica Serovar Enteritidis , 2014, Emerging infectious diseases.

[30]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[31]  Martin C. J. Maiden,et al.  BIGSdb: Scalable analysis of bacterial genome variation at the population level , 2010, BMC Bioinformatics.

[32]  Richard J Ellis,et al.  Whole-genome sequencing for national surveillance of Shiga toxin-producing Escherichia coli O157. , 2015, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[33]  Marc L. Salit,et al.  Best practices for evaluating single nucleotide variant calling methods for microbial genomics , 2015, Front. Genet..

[34]  P. D. Rijk,et al.  Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing , 2011, Nature Biotechnology.

[35]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[36]  J. Long,et al.  Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data , 2012, BMC Genomics.

[37]  Mark M. Tanaka,et al.  Defining the Core Genome of Salmonella enterica Serovar Typhimurium for Genomic Surveillance and Epidemiological Typing , 2015, Journal of Clinical Microbiology.

[38]  Adam M. Phillippy,et al.  Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies , 2013, Briefings Bioinform..