论文信息 - CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data

CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data

The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix) for a set of individuals. The management and chaining of these software pieces and their outputs can often be a cumbersome and diffi cult task. Here, we present CFSAN SNP Pipeline, which combines into a single package the mapping of NGS reads to a reference genome with Bowtie2, processing of those mapping (BAM) files using SAMtools, identification of variant sites using VarScan, and production of a SNP matrix using custom Python scripts. We also introduce a Python package (CFSAN SNP Mutator) that when given a reference genome will generate variants of known position against which we validate our pipeline. We created 1,000 simulated Salmonella enterica sp. enterica Serovar Agona genomes at 100× and 20× coverage, each containing 500 SNPs, 20 single-base insertions and 20 single-base deletions. For the 100× dataset, the CFSAN SNP Pipeline recovered 98.9% of the introduced SNPs and had a false positive rate of 1.04 × 10 −6 ; for the 20× dataset 98.8% of SNPs were recovered and the false positive rate was 8.34 × 10 −7 . Based on these results, CFSAN SNP Pipeline is a robust and accurate tool that it is among the first to combine into a single executable the myriad steps required to produce a SNP matrix from NGS data. Such a tool is useful to those working in an applied setting (e.g., food safety traceback investigations) as well as for those interested in evolutionary questions.

[1] Mikhail Pachkov,et al. Automated Reconstruction of Whole-Genome Phylogenies from Short-Sequence Reads , 2014, Molecular biology and evolution.

[2] Matthew D. MacManes,et al. On the optimal trimming of high-throughput mRNA sequence data , 2014, Front. Genet..

[3] Barry G. Hall,et al. When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes , 2013, PloS one.

[4] Yi Chen,et al. Distributed under Creative Commons Cc-by 4.0 an Evaluation of Alternative Methods for Constructing Phylogenies from Whole Genome Sequence Data: a Case Study with Salmonella Background , 2022 .

[5] Errol Strain,et al. Identification of a salmonellosis outbreak by means of molecular sequencing. , 2011, The New England journal of medicine.

[6] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[7] Reed A. Cartwright,et al. SISRS: SNP Identification from Short Read Sequences , 2013 .

[8] M. Morgante,et al. An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis , 2013, PloS one.

[9] Ruth Timme,et al. On the Evolutionary History, Population Genetics and Diversity among Isolates of Salmonella Enteritidis PFGE Pattern JEGX01.0004 , 2013, PloS one.

[10] Steven L Salzberg,et al. Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[11] Tae-Ho Lee,et al. SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data , 2014, BMC Genomics.

[12] Leping Li,et al. ART: a next-generation sequencing read simulator , 2012, Bioinform..

[13] Arthur W. Pightling,et al. Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses , 2014, PloS one.

[14] Christopher A. Miller,et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.