CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data

The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix) for a set of individuals. The management and chaining of these software pieces and their outputs can often be a cumbersome and diffi cult task. Here, we present CFSAN SNP Pipeline, which combines into a single package the mapping of NGS reads to a reference genome with Bowtie2, processing of those mapping (BAM) files using SAMtools, identification of variant sites using VarScan, and production of a SNP matrix using custom Python scripts. We also introduce a Python package (CFSAN SNP Mutator) that when given a reference genome will generate variants of known position against which we validate our pipeline. We created 1,000 simulated Salmonella enterica sp. enterica Serovar Agona genomes at 100× and 20× coverage, each containing 500 SNPs, 20 single-base insertions and 20 single-base deletions. For the 100× dataset, the CFSAN SNP Pipeline recovered 98.9% of the introduced SNPs and had a false positive rate of 1.04 × 10 −6 ; for the 20× dataset 98.8% of SNPs were recovered and the false positive rate was 8.34 × 10 −7 . Based on these results, CFSAN SNP Pipeline is a robust and accurate tool that it is among the first to combine into a single executable the myriad steps required to produce a SNP matrix from NGS data. Such a tool is useful to those working in an applied setting (e.g., food safety traceback investigations) as well as for those interested in evolutionary questions.

[1]  Mikhail Pachkov,et al.  Automated Reconstruction of Whole-Genome Phylogenies from Short-Sequence Reads , 2014, Molecular biology and evolution.

[2]  Matthew D. MacManes,et al.  On the optimal trimming of high-throughput mRNA sequence data , 2014, Front. Genet..

[3]  Barry G. Hall,et al.  When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes , 2013, PloS one.

[4]  Yi Chen,et al.  Distributed under Creative Commons Cc-by 4.0 an Evaluation of Alternative Methods for Constructing Phylogenies from Whole Genome Sequence Data: a Case Study with Salmonella Background , 2022 .

[5]  Errol Strain,et al.  Identification of a salmonellosis outbreak by means of molecular sequencing. , 2011, The New England journal of medicine.

[6]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7]  Reed A. Cartwright,et al.  SISRS: SNP Identification from Short Read Sequences , 2013 .

[8]  M. Morgante,et al.  An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis , 2013, PloS one.

[9]  Ruth Timme,et al.  On the Evolutionary History, Population Genetics and Diversity among Isolates of Salmonella Enteritidis PFGE Pattern JEGX01.0004 , 2013, PloS one.

[10]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[11]  Tae-Ho Lee,et al.  SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data , 2014, BMC Genomics.

[12]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[13]  Arthur W. Pightling,et al.  Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses , 2014, PloS one.

[14]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.