StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees

Background Fast, accurate and high-throughput identification of bacterial isolates is in great demand. The present work was conducted to investigate the possibility of identifying isolates from unassembled next-generation sequencing reads using custom-made guide trees. Results A tool named StrainSeeker was developed that constructs a list of specific k-mers for each node of any given Newick-format tree and enables the identification of bacterial isolates in 1–2 min. It uses a novel algorithm, which analyses the observed and expected fractions of node-specific k-mers to test the presence of each node in the sample. This allows StrainSeeker to determine where the isolate branches off the guide tree and assign it to a clade whereas other tools assign each read to a reference genome. Using a dataset of 100 Escherichia coli isolates, we demonstrate that StrainSeeker can predict the clades of E. coli with 92% accuracy and correct tree branch assignment with 98% accuracy. Twenty-five thousand Illumina HiSeq reads are sufficient for identification of the strain. Conclusion StrainSeeker is a software program that identifies bacterial isolates by assigning them to nodes or leaves of a custom-made guide tree. StrainSeeker’s web interface and pre-computed guide trees are available at http://bioinfo.ut.ee/strainseeker. Source code is stored at GitHub: https://github.com/bioinfo-ut/StrainSeeker.

[1]  Ruiting Lan,et al.  Escherichia coli in disguise: molecular origins of Shigella. , 2002, Microbes and infection.

[2]  Lauris Kaplinski,et al.  GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists , 2015, GigaScience.

[3]  Ole Lund,et al.  Multilocus Sequence Typing of Total-Genome-Sequenced Bacteria , 2012, Journal of Clinical Microbiology.

[4]  Ole Lund,et al.  Reads2Type: a web application for rapid microbial taxonomy identification , 2015, BMC Bioinformatics.

[5]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[6]  Chongle Pan,et al.  Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance , 2014, Bioinform..

[7]  Peng Sun,et al.  Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering , 2014, Nucleic acids research.

[8]  Nicola K. Petty,et al.  Global dissemination of a multidrug resistant Escherichia coli clone , 2014, Proceedings of the National Academy of Sciences.

[9]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[10]  Qichao Tu,et al.  Strain/species identification in metagenomes using genome-specific markers , 2014, Nucleic acids research.

[11]  Ole Lund,et al.  Rapid Whole-Genome Sequencing for Detection and Characterization of Microorganisms Directly from Clinical Samples , 2013, Journal of Clinical Microbiology.

[12]  S. Borrell,et al.  KvarQ: targeted and direct variant calling from fastq reads of bacterial genomes , 2014, BMC Genomics.

[13]  Raymond Lo,et al.  Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities , 2015, BMC Bioinformatics.

[14]  Koichiro Tamura,et al.  MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. , 2013, Molecular biology and evolution.

[15]  P. Rauch,et al.  The potential of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry for the identification of biogroups of Cronobacter sakazakii. , 2013, Rapid communications in mass spectrometry : RCM.

[16]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[17]  Phelim Bradley,et al.  Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis , 2015, bioRxiv.

[18]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[19]  Phelim Bradley,et al.  Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis , 2015, Nature Communications.

[20]  Justin Zobel,et al.  SRST2: Rapid genomic surveillance for public health and hospital microbiology labs , 2014, bioRxiv.

[21]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[22]  M. Maiden Multilocus sequence typing of bacteria. , 2006, Annual review of microbiology.

[23]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015, Scientific Reports.

[24]  Masahira Hattori,et al.  Comparative genomics reveal the mechanism of the parallel evolution of O157 and non-O157 enterohemorrhagic Escherichia coli , 2009, Proceedings of the National Academy of Sciences.