BPGA- an ultra-fast pan-genome analysis pipeline

Recent advances in ultra-high-throughput sequencing technology and metagenomics have led to a paradigm shift in microbial genomics from few genome comparisons to large-scale pan-genome studies at different scales of phylogenetic resolution. Pan-genome studies provide a framework for estimating the genomic diversity of the dataset, determining core (conserved), accessory (dispensable) and unique (strain-specific) gene pool of a species, tracing horizontal gene-flux across strains and providing insight into species evolution. The existing pan genome software tools suffer from various limitations like limited datasets, difficult installation/requirements, inadequate functional features etc. Here we present an ultra-fast computational pipeline BPGA (Bacterial Pan Genome Analysis tool) with seven functional modules. In addition to the routine pan genome analyses, BPGA introduces a number of novel features for downstream analyses like core/pan/MLST (Multi Locus Sequence Typing) phylogeny, exclusive presence/absence of genes in specific strains, subset analysis, atypical G + C content analysis and KEGG & COG mapping of core, accessory and unique genes. Other notable features include minimum running prerequisites, freedom to select the gene clustering method, ultra-fast execution, user friendly command line interface and high-quality graphics outputs. The performance of BPGA has been evaluated using a dataset of complete genome sequences of 28 Streptococcus pyogenes strains.

[1]  Justin S. Hogg,et al.  Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains , 2007, Genome Biology.

[2]  A. Salamov,et al.  Pan genome of the phytoplankton Emiliania underpins its global distribution , 2013, Nature.

[3]  Pascal Lapierre,et al.  Estimating the size of the bacterial pan-genome. , 2009, Trends in genetics : TIG.

[4]  H. Tettelin,et al.  Identification of a Universal Group B Streptococcus Vaccine by Multiple Genome Screen , 2005, Science.

[5]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Rino Rappuoli,et al.  Post‐genomic vaccine development , 2006, FEBS letters.

[7]  Daniel J. Kvitek,et al.  Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments , 2012, Genome research.

[8]  Jun Yu,et al.  A Brief Review of Software Tools for Pangenomics , 2015, Genom. Proteom. Bioinform..

[9]  C. Dutta,et al.  Divergences in gene repertoire among the reference Prevotella genomes derived from distinct body sites of human , 2015, BMC Genomics.

[10]  Derrick E. Fouts,et al.  PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species , 2012, Nucleic acids research.

[11]  Yongxiang Zhang,et al.  Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions , 2010, BMC Bioinformatics.

[12]  L. Björck,et al.  Streptococcus pyogenes expressing M and M‐like surface proteins are phagocytosed but survive inside human neutrophils , 2003, Cellular microbiology.

[13]  Matthew N. Benedict,et al.  ITEP: An integrated toolkit for exploration of microbial pan-genomes , 2014, BMC Genomics.

[14]  D. Raoult,et al.  Complete genome sequence of Cannes 8 virus, a new member of the proposed family “Marseilleviridae” , 2013, Virus Genes.

[15]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[16]  F. Rodríguez-Valera,et al.  The bacterial pan-genome:a new paradigm in microbiology. , 2010, International microbiology : the official journal of the Spanish Society for Microbiology.

[17]  P. Renault,et al.  Genomics of Streptococcus salivarius, a major human commensal. , 2015, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[18]  R. Siezen,et al.  Lactobacillus paracasei Comparative Genomics: Towards Species Pan-Genome Definition and Exploitation of Diversity , 2013, PloS one.

[19]  Jacques Ravel,et al.  Genome Sequence of the Deep-Rooted Yersinia pestis Strain Angola Reveals New Insights into the Evolution and Pangenome of the Plague Bacterium , 2010, Journal of bacteriology.

[20]  Trygve Almøy,et al.  Microbial comparative pan-genomics using binomial mixture models , 2009, BMC Genomics.

[21]  P. Gajer,et al.  The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates , 2008, Journal of bacteriology.

[22]  Feng Chen,et al.  Patterns and Implications of Gene Gain and Loss in the Evolution of Prochlorococcus , 2007, PLoS genetics.

[23]  P. Dhar,et al.  Genome reduction in prokaryotic obligatory intracellular parasites of humans: a comparative analysis. , 2004, International journal of systematic and evolutionary microbiology.

[24]  N. Moran,et al.  Extreme genome reduction in symbiotic bacteria , 2011, Nature Reviews Microbiology.

[25]  M. Achtman,et al.  Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Andrew J. Page,et al.  Roary: rapid large-scale prokaryote pan genome analysis , 2015, bioRxiv.

[27]  Jun Yu,et al.  PGAP: pan-genomes analysis pipeline , 2011, Bioinform..

[28]  Jun Yu,et al.  PanGP: A tool for quickly analyzing bacterial pan-genome profile , 2014, Bioinform..

[29]  Karsten M. Borgwardt,et al.  Whole-genome sequencing of multiple Arabidopsis thaliana populations , 2011, Nature Genetics.

[30]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[31]  K. Artzt,et al.  Mapping of the Pim-1 oncogene in mouse t-haplotypes and its use to define the relative map positions of the tcl loci t0(t6) and tw12 and the marker tf (tufted). , 1991, Genomics.

[32]  Christophe Dessimoz,et al.  Inferring Horizontal Gene Transfer , 2015, PLoS Comput. Biol..

[33]  Evan Powell,et al.  Comparative Genomic Analyses of Seventeen Streptococcus pneumoniae Strains: Insights into the Pneumococcal Supragenome , 2007, Journal of bacteriology.

[34]  Rene S. Hendriksen,et al.  The Salmonella enterica Pan-genome , 2011, Microbial Ecology.

[35]  M. Stanhope,et al.  Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition , 2007, Genome Biology.

[36]  Ruiqiang Li,et al.  De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits , 2014, Nature Biotechnology.

[37]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[38]  David R. Riley,et al.  Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species , 2010, Genome Biology.

[39]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[40]  R. Siezen,et al.  In Silico Prediction of Horizontal Gene Transfer Events in Lactobacillus bulgaricus and Streptococcus thermophilus Reveals Protocooperation in Yogurt Manufacturing , 2009, Applied and Environmental Microbiology.

[41]  G. Dougan,et al.  Emergence of scarlet fever Streptococcus pyogenes emm12 clones in Hong Kong is associated with toxin acquisition and multidrug resistance , 2014, Nature Genetics.

[42]  Christine Fong,et al.  Bioinformatics Applications Note Genome Analysis Pgat: a Multistrain Analysis Resource for Microbial Genomes , 2022 .

[43]  David R. Riley,et al.  Ten years of pan-genome analyses. , 2015, Current opinion in microbiology.

[44]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.