Hierarchical sets: analyzing pangenome structure through scalable set visualizations

Motivation: The increase in available microbial genome sequences has resulted in an increase in the size of the pangenomes being analyzed. Current pangenome visualizations are not intended for the pangenome sizes possible today and new approaches are necessary in order to convert the increase in available information to increase in knowledge. As the pangenome data structure is essentially a collection of sets we explore the potential for scalable set visualization as a tool for pangenome analysis. Results: We present a new hierarchical clustering algorithm based on set arithmetics that optimizes the intersection sizes along the branches. The intersection and union sizes along the hierarchy are visualized using a composite dendrogram and icicle plot, which, in pangenome context, shows the evolution of pangenome and core size along the evolutionary hierarchy. Outlying elements, i.e. elements whose presence pattern do not correspond with the hierarchy, can be visualized using hierarchical edge bundles. When applied to pangenome data this plot shows putative horizontal gene transfers between the genomes and can highlight relationships between genomes that is not represented by the hierarchy. We illustrate the utility of hierarchical sets by applying it to a pangenome based on 113 Escherichia and Shigella genomes and find it provides a powerful addition to pangenome analysis. Availability and Implementation: The described clustering algorithm and visualizations are implemented in the hierarchicalSets R package available from CRAN (https://cran.r‐project.org/web/packages/hierarchicalSets) Contact: thomasp85@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Rene S. Hendriksen,et al.  The Salmonella enterica Pan-genome , 2011, Microbial Ecology.

[2]  I. Nookaew,et al.  Diversity of Pseudomonas Genomes, Including Populus-Associated Isolates, as Revealed by Comparative Genome Analysis , 2015, Applied and Environmental Microbiology.

[3]  Tetsuya Hayashi,et al.  Defining the Genome Features of Escherichia albertii, an Emerging Enteropathogen Closely Related to Escherichia coli , 2015, Genome biology and evolution.

[4]  Masahira Hattori,et al.  Comparative genomics reveal the mechanism of the parallel evolution of O157 and non-O157 enterohemorrhagic Escherichia coli , 2009, Proceedings of the National Academy of Sciences.

[5]  Fredrik H. Karlsson,et al.  A Closer Look at Bacteroides: Phylogenetic Relationship and Genomic Implications of a Life in the Human Gut , 2011, Microbial Ecology.

[6]  Hanspeter Pfister,et al.  UpSet: Visualization of Intersecting Sets , 2014, IEEE Transactions on Visualization and Computer Graphics.

[7]  Intawat Nookaew,et al.  PanViz: interactive visualization of the structure of functionally annotated pangenomes , 2016, Bioinform..

[8]  Robert Kosara,et al.  GenoSets: Visual Analytic Methods for Comparative Genomics , 2012, PloS one.

[9]  Keith A. Jolley,et al.  A Reference Pan-Genome Approach to Comparative Bacterial Genomics: Identification of Novel Epidemiological Markers in Pathogenic Campylobacter , 2014, PloS one.

[10]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[11]  D. Ussery,et al.  Comparative Genomics of Bifidobacterium, Lactobacillus and Related Probiotic Genera , 2011, Microbial Ecology.

[12]  Carsten Friis,et al.  Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes , 2012, BMC Genomics.

[13]  Silvia Miksch,et al.  Radial Sets: Interactive Visual Analysis of Large Overlapping Sets , 2013, IEEE Transactions on Visualization and Computer Graphics.

[14]  Jinghua Yang,et al.  Structural and Genetic Characterization of the Shigella boydii Type 13 O Antigen , 2004, Journal of bacteriology.

[15]  G. Weinstock,et al.  Phylogenomics and the Dynamic Genome Evolution of the Genus Streptococcus , 2014, Genome biology and evolution.

[16]  Gilles Bisson,et al.  Dendrogramix: A hybrid tree-matrix visualization technique to support interactive exploration of dendrograms , 2015, 2015 IEEE Pacific Visualization Symposium (PacificVis).

[17]  R. Siezen,et al.  Lactobacillus paracasei Comparative Genomics: Towards Species Pan-Genome Definition and Exploitation of Diversity , 2013, PloS one.

[18]  G. Pupo,et al.  Multiple independent origins of Shigella clones of Escherichia coli and convergent evolution of many of their characteristics. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Sung-Hou Kim,et al.  Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs) , 2011, Proceedings of the National Academy of Sciences.

[20]  I. Nookaew,et al.  Insights from 20 years of bacterial genome sequencing , 2015, Functional & Integrative Genomics.

[21]  Danny Holten,et al.  Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[22]  D. Ussery,et al.  Comparison of 61 Sequenced Escherichia coli Genomes , 2010, Microbial Ecology.

[23]  D. Ussery,et al.  A domain sequence approach to pangenomics: applications to Escherichia coli , 2012, F1000Research.

[24]  Zhao Xu,et al.  Shigella Strains Are Not Clones of Escherichia coli but Sister Species in the Genus Escherichia , 2012, Genom. Proteom. Bioinform..

[25]  Daniel J. Wilson,et al.  Global Genomic Epidemiology of Salmonella enterica Serovar Typhimurium DT104 , 2016, Applied and Environmental Microbiology.

[26]  F. Delsuc Comparative Genomics , 2010, Lecture Notes in Computer Science.

[27]  A. Goesmann,et al.  Reassessment of the Listeria monocytogenes pan-genome reveals dynamic integration hotspots and mobile genetic elements as major components of the accessory genome , 2013, BMC Genomics.

[28]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[29]  W. Gaastra,et al.  Escherichia fergusonii. , 2020, Veterinary microbiology.

[30]  Ian K Toth,et al.  Analysis of the Pantoea ananatis pan-genome reveals factors underlying its ability to colonize and interact with plant, insect and vertebrate hosts , 2014, BMC Genomics.

[31]  D. W. Kim,et al.  Shigella sonnei genome sequencing and phylogenetic analysis indicate recent global dissemination from Europe , 2012, Nature Genetics.