The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes

Background. As whole genome sequence data from bacterial isolates becomes cheaper to generate, computational methods are needed to correlate sequence data with biological observations. Here we present the large-scale BLAST score ratio (LS-BSR) pipeline, which rapidly compares the genetic content of hundreds to thousands of bacterial genomes, and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed. This matrix can be easily parsed in order to identify genetic relationships between bacterial genomes. Although pipelines have been published that group peptides by sequence similarity, no other software performs the rapid, large-scale, full-genome comparative analyses carried out by LS-BSR. Results. To demonstrate the utility of the method, the LS-BSR pipeline was tested on 96 Escherichia coli and Shigella genomes; the pipeline ran in 163 min using 16 processors, which is a greater than 7-fold speedup compared to using a single processor. The BSR values for each CDS, which indicate a relative level of relatedness, were then mapped to each genome on an independent core genome single nucleotide polymorphism (SNP) based phylogeny. Comparisons were then used to identify clade specific CDS markers and validate the LS-BSR pipeline based on molecular markers that delineate between classical E. coli pathogenic variant (pathovar) designations. Scalability tests demonstrated that the LS-BSR pipeline can process 1,000 E. coli genomes in 27–57 h, depending upon the alignment method, using 16 processors. Conclusions. LS-BSR is an open-source, parallel implementation of the BSR algorithm, enabling rapid comparison of the genetic content of large numbers of genomes. The results of the pipeline can be used to identify specific markers between user-defined phylogenetic groups, and to identify the loss and/or acquisition of genetic information between bacterial isolates. Taxa-specific genetic markers can then be translated into clinical diagnostics, or can be used to identify broadly conserved putative therapeutic candidates.

[1]  Jason W. Sahl,et al.  A Comparative Genomic Analysis of Diverse Clonal Types of Enterotoxigenic Escherichia coli Reveals Pathovar-Specific Conservation , 2010, Infection and Immunity.

[2]  B. Contreras-Moreira,et al.  GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pangenome Analysis , 2013, Applied and Environmental Microbiology.

[3]  Sung-Hou Kim,et al.  Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) , 2011, Proceedings of the National Academy of Sciences.

[4]  David A Rasko,et al.  Refining the pathovar paradigm via phylogenomics of the attaching and effacing Escherichia coli , 2013, Proceedings of the National Academy of Sciences.

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[7]  Jun Yu,et al.  PGAP: pan-genomes analysis pipeline , 2011, Bioinform..

[8]  Jun Yu,et al.  PanGP: A tool for quickly analyzing bacterial pan-genome profile , 2014, Bioinform..

[9]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[10]  W. Liang,et al.  TM4 microarray software suite. , 2006, Methods in enzymology.

[11]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[12]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[13]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[14]  Gaston H. Gonnet,et al.  Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs , 2013, PloS one.

[15]  W. Liang,et al.  9) TM4 Microarray Software Suite , 2006 .

[16]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[17]  Steven Salzberg,et al.  Mugsy: fast multiple alignment of closely related whole genomes , 2010, Bioinform..

[18]  Matthew N. Benedict,et al.  ITEP: An integrated toolkit for exploration of microbial pan-genomes , 2014, BMC Genomics.

[19]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[20]  Paul Keim,et al.  Evolution of a Pathogen: A Comparative Genomics Analysis Identifies a Genetic Pathway to Pathogenesis in Acinetobacter , 2013, PloS one.

[21]  Jacques Ravel,et al.  Visualization of comparative genomic analyses by BLAST score ratio , 2005, BMC Bioinformatics.

[22]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[23]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[24]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[25]  P. Gajer,et al.  The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates , 2008, Journal of bacteriology.

[26]  David R. Riley,et al.  Comparative genomics: the bacterial pan-genome. , 2008, Current opinion in microbiology.

[27]  Peer Bork,et al.  Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat , 2008, Molecular systems biology.