Scalable SNP Analyses of 100+ Bacterial or Viral Genomes

With the flood of whole genome finished and draft microbial sequences, analysts need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/ or draft genomes available as unassembled contigs. The method is fast to compute, finding SNPs and building a SNP phylogeny in seconds to hours. It identified thousands of putative SNPs from all publicly available Filoviridae, Poxviridae, foot-and-mouth disease virus, Bacillus, and Escherichia coli genomes and plasmids. The SNP-based trees it generated were consistent with known taxonomy and trees determined in other studies. The approach described can handle as input hundreds of megabases of sequence in a single run. The algorithm kSNP is based on k-mer analysis using suffix arrays and requires no multiple sequence alignment. uncover novel regions that correlate with phenotype outside of wellcharacterized genes or non-coding sequence. It should also be useful in horizontal gene transfer studies, since one can examine SNPs across the entire genome. Although beyond the scope of this paper, microarrays with probes designed for all putative SNPs can be used to experimentally validate SNP alleles, identify sequencing errors, and characterize SNP alleles in unsequenced isolates to place them on a phylogeny (manuscript in preparation).

[1]  Daniel H. Huson,et al.  Dendroscope: An interactive viewer for large phylogenetic trees , 2007, BMC Bioinformatics.

[2]  Yu Li,et al.  On the origin of smallpox: Correlating variola phylogenics with historical smallpox records , 2007, Proceedings of the National Academy of Sciences.

[3]  Shea N Gardner,et al.  Software for optimization of SNP and PCR-RFLP genotyping to discriminate many genomes with the fewest assays , 2005, BMC Genomics.

[4]  M. Frace,et al.  The Phylogenetics and Ecology of the Orthopoxviruses Endemic to North America , 2009, PloS one.

[5]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[6]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[7]  Y. Kashi,et al.  Phylogeny and Strain Typing of Escherichia coli, Inferred from Variation at Mononucleotide Repeat Loci , 2004, Applied and Environmental Microbiology.

[8]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[9]  Paul Keim,et al.  Phylogenetic understanding of clonal populations in an era of whole genome sequencing. , 2009, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[10]  S. Cleaveland,et al.  Molecular epidemiology of foot-and-mouth disease virus. , 2003, Virus research.

[11]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[12]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[13]  L. Real,et al.  Correction for Wittmann et al., Isolates of Zaire ebolavirus from wild apes reveal genetic lineage and recombinants , 2007, Proceedings of the National Academy of Sciences.

[14]  Vipin Chandra Kalia,et al.  Phylogeny in Aid of the Present and Novel Microbial Lineages: Diversity in Bacillus , 2009, PloS one.