Pan-Genome Storage and Analysis Techniques.

Computational pan-genome analysis has emerged from the rapid increase of available genome sequencing data. Starting from a microbial pan-genome, the concept has spread to a variety of species, such as plants or viruses. Characterizing a pan-genome provides insights into intra-species evolution, functions, and diversity. However, researchers face challenges such as processing and maintaining large datasets while providing accurate and efficient analysis approaches. Comparative genomics methods are required for detecting conserved and unique regions between a set of genomes. This chapter gives an overview of tools available for indexing pan-genomes, identifying the sub-regions of a pan-genome and offering a variety of downstream analysis methods. These tools are categorized into two groups, gene-based and sequence-based, according to the pan-genome identification method. We highlight the differences, advantages, and disadvantages between the tools, and provide information about the general workflow, methodology of pan-genome identification, covered functionalities, usability and availability of the tools.

[1]  Matthew N. Benedict,et al.  ITEP: An integrated toolkit for exploration of microbial pan-genomes , 2014, BMC Genomics.

[2]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[3]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[5]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[6]  Kay Nieselt,et al.  Pan-Tetris: an interactive visualisation for Pan-genomes , 2015, BMC Bioinformatics.

[7]  Alexander Goesmann,et al.  EDGAR 2.0: an enhanced software platform for comparative gene content analyses , 2016, Nucleic Acids Res..

[8]  N. Warthmann,et al.  Simultaneous alignment of short reads against multiple genomes , 2009, Genome Biology.

[9]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[10]  Lin Huang,et al.  Short read alignment with populations of genomes , 2013, Bioinform..

[11]  Michael C. Schatz,et al.  SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips , 2014, Bioinform..

[12]  Sandip Paul,et al.  PanCoreGen - Profiling, detecting, annotating protein-coding genes in microbial genomes. , 2015, Genomics.

[13]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[14]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[15]  Alexander Goesmann,et al.  EDGAR: A software framework for the comparative analysis of prokaryotic genomes , 2009, BMC Bioinformatics.

[16]  David R. Riley,et al.  Comparative genomics: the bacterial pan-genome. , 2008, Current opinion in microbiology.

[17]  Brian D. Ondov,et al.  The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes , 2014, Genome Biology.

[18]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016 .

[19]  Szymon Grabowski,et al.  Indexes of Large Genome Collections on a PC , 2014, PloS one.

[20]  R. Mott,et al.  The 1001 Genomes Project for Arabidopsis thaliana , 2009, Genome Biology.

[21]  M. A. Pedraza,et al.  Insights into the Maize Pan-Genome and Pan-Transcriptome[W][OPEN] , 2014, Plant Cell.

[22]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[23]  David R. Riley,et al.  Ten years of pan-genome analyses. , 2015, Current opinion in microbiology.

[24]  David Haussler,et al.  Cactus Graphs for Genome Comparisons , 2010, RECOMB.

[25]  F. Rodríguez-Valera,et al.  The bacterial pan-genome:a new paradigm in microbiology. , 2010, International microbiology : the official journal of the Spanish Society for Microbiology.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  Andrew J. Page,et al.  Roary: rapid large-scale prokaryote pan genome analysis , 2015, bioRxiv.

[28]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[29]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[30]  Gonzalo Navarro,et al.  Indexing Highly Repetitive Collections , 2012, IWOCA.

[31]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[32]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[33]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[34]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[35]  Jerzy Tiuryn,et al.  eCAMBer: efficient support for large-scale comparative analysis of multiple bacterial strains , 2014, BMC Bioinformatics.

[36]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[37]  Knut Reinert,et al.  Journaled string tree - a scalable data structure for analyzing thousands of similar genomes on your laptop , 2014, Bioinform..

[38]  Erik L. L. Sonnhammer,et al.  InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic , 2014, Nucleic Acids Res..

[39]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[40]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[41]  David Haussler,et al.  Building a Pan-Genome Reference for a Population , 2015, J. Comput. Biol..

[42]  Bhanu K. Kamapantula,et al.  PANNOTATOR: an automated tool for annotation of pan-genomes. , 2013, Genetics and molecular research : GMR.

[43]  Jaideep P. Sundaram,et al.  Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[45]  R. Giegerich,et al.  GenDB--an open source genome annotation system for prokaryote genomes. , 2003, Nucleic acids research.

[46]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[47]  B. Contreras-Moreira,et al.  GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pangenome Analysis , 2013, Applied and Environmental Microbiology.

[48]  J. Gregory Caporaso,et al.  The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes , 2014, PeerJ.

[49]  Yongxiang Zhang,et al.  Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions , 2010, BMC Bioinformatics.

[50]  Vincent Daubin,et al.  Examining bacterial species under the specter of gene transfer and exchange , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[52]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[53]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[54]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[55]  Chitra Dutta,et al.  BPGA- an ultra-fast pan-genome analysis pipeline , 2016, Scientific Reports.

[56]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[57]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[58]  Christine Fong,et al.  Bioinformatics Applications Note Genome Analysis Pgat: a Multistrain Analysis Resource for Microbial Genomes , 2022 .

[59]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[60]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[61]  Derrick E. Fouts,et al.  PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species , 2012, Nucleic acids research.

[62]  Sven Rahmann,et al.  PanCake: A Data Structure for Pangenomes , 2013, GCB.

[63]  Kay Nieselt,et al.  GenomeRing: alignment visualization based on SuperGenome coordinates , 2012, Bioinform..

[64]  Enno Ohlebusch,et al.  Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform , 2016, Bioinform..

[65]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[66]  Faraz Hach,et al.  Dynamic Alignment-Free and Reference-Free Read Compression , 2018, J. Comput. Biol..

[67]  Paul Medvedev,et al.  TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes , 2016, Bioinform..

[68]  Ying Zhang,et al.  Computational pan-genomics: status, promises and challenges , 2016, bioRxiv.

[69]  Ulf Leser,et al.  RCSI: Scalable similarity search in thousand(s) of genomes , 2013, Proc. VLDB Endow..

[70]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[71]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[72]  Lavanya Kannan,et al.  A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches , 2010, Bioinform..

[73]  S. Pongor,et al.  The quest for orthologs: finding the corresponding gene across genomes. , 2008, Trends in genetics : TIG.

[74]  Feng Chen,et al.  Comparative Genomic and Phylogenomic Analyses Reveal a Conserved Core Genome Shared by Estuarine and Oceanic Cyanopodoviruses , 2015, PloS one.

[75]  Steven Salzberg,et al.  Improving pan-genome annotation using whole genome multiple alignment , 2011, BMC Bioinformatics.

[76]  Timothy D. Read,et al.  Opening the pan-genomics box , 2006 .

[77]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[78]  Jun Yu,et al.  PanGP: A tool for quickly analyzing bacterial pan-genome profile , 2014, Bioinform..

[79]  Oksana Lukjancenko,et al.  PanFunPro: PAN-genome analysis based on FUNctional PROfiles , 2013 .

[80]  Hugh E. Williams,et al.  Burst tries: a fast, efficient data structure for string keys , 2002, TOIS.

[81]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[82]  Sonia Cárdenas-Brito,et al.  Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species , 2015, BMC Genomics.

[83]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[84]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[85]  Michele Morgante,et al.  Transposable elements and the plant pan-genomes. , 2007, Current opinion in plant biology.

[86]  Gang Liu,et al.  Automatic clustering of orthologs and inparalogs shared by multiple proteomes , 2006, ISMB.

[87]  Trygve Almøy,et al.  Microbial comparative pan-genomics using binomial mixture models , 2009, BMC Genomics.

[88]  Carl Kingsford,et al.  Fast Search of Thousands of Short-Read Sequencing Experiments , 2015, Nature Biotechnology.

[89]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[90]  Amjad Ali,et al.  Pangenome and immuno-proteomics analysis of Acinetobacter baumannii strains revealed the core peptide vaccine targets , 2016, BMC Genomics.

[91]  Jens Stoye,et al.  Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage , 2016, Algorithms for Molecular Biology.

[92]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[93]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[94]  Veli Mäkinen,et al.  Indexing Finite Language Representation of Population Genotypes , 2010, WABI.

[95]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[96]  Jun Yu,et al.  PGAP: pan-genomes analysis pipeline , 2011, Bioinform..