AGC: compact representation of assembled genomes with fast queries and updates

Abstract Motivation High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. Results Here, we show how to reduce the size of the sequenced genomes by 2–3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data. Availability and implementation The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  William T. Harvey,et al.  The complete sequence of a human Y chromosome , 2022, bioRxiv.

[2]  Joshua F. McMichael,et al.  The Human Pangenome Project: a global resource to map genomic diversity , 2022, Nature.

[3]  Aaron M. Streets,et al.  The complete sequence of a human genome , 2021, bioRxiv.

[4]  Omar T. Hammouda,et al.  Genomic variations and epigenomic landscape of the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel , 2021, bioRxiv.

[5]  Karen H. Miga,et al.  The Need for a Human Pangenome Reference Sequence. , 2021, Annual review of genomics and human genetics.

[6]  Blaise T. F. Alako,et al.  Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences , 2021, bioRxiv.

[7]  William T. Harvey,et al.  Haplotype-resolved diverse human genomes and integrated analysis of structural variation , 2021, Science.

[8]  Heng Li,et al.  Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm , 2021, Nature Methods.

[9]  Szymon Grabowski,et al.  MBGC: Multiple Bacteria Genome Compressor , 2020, bioRxiv.

[10]  P. Langridge,et al.  The barley pan-genome reveals the hidden legacy of mutation breeding , 2020, Nature.

[11]  Diogo Pratas,et al.  Efficient DNA sequence compression with neural networks , 2020, GigaScience.

[12]  J. Batley,et al.  Plant pan-genomes are the new reference , 2020, Nature Plants.

[13]  HaichangYao,et al.  HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data , 2019 .

[14]  Tadashi Imanishi,et al.  Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences , 2018, bioRxiv.

[15]  Sebastian Deorowicz,et al.  Even faster sorting of (not only) integers , 2017, ICMMI.

[16]  Sebastian Deorowicz,et al.  GDC 2: Compression of large collections of genomes , 2015, Scientific Reports.

[17]  Ulf Leser,et al.  FRESCO: Referential Compression of Highly Similar Sequences , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[19]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[20]  Giovanni Motta,et al.  Handbook of Data Compression , 2009 .

[21]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[22]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.