Allowing mutations in maximal matches boosts genome compression performance

MOTIVATION A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. RESULTS We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/memRGC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[2]  Ulf Leser,et al.  Trends in Genome Compression , 2014 .

[3]  Szymon Grabowski,et al.  Data compression for sequencing data , 2013, Algorithms for Molecular Biology.

[4]  Armando J. Pinho,et al.  A Compression Model for DNA Multiple Sequence Alignment Blocks , 2013, IEEE Transactions on Information Theory.

[5]  Jinyan Li,et al.  Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers , 2019, Bioinform..

[6]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[7]  Congmao Wang,et al.  A novel compression tool for efficient storage of genome resequencing data , 2011, Nucleic acids research.

[8]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[9]  Szymon Grabowski,et al.  Genome compression: a novel approach for large collections , 2013, Bioinform..

[10]  Tsachy Weissman,et al.  The human genome contracts again , 2013, Bioinform..

[11]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[12]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[13]  Justin Chu,et al.  BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters , 2014, Bioinform..

[14]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[15]  Armando J. Pinho,et al.  MFCompress: a compression tool for FASTA and multi-FASTA data , 2013, Bioinform..

[16]  Shangdong Liu,et al.  HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data , 2019, BioMed research international.

[17]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[18]  Szymon Grabowski,et al.  copMEM: finding maximal exact matches via sampling both genomes , 2019, Bioinform..

[19]  Zhen Ji,et al.  High-throughput DNA sequence data compression , 2015, Briefings Bioinform..

[20]  Mikel Hernaez,et al.  SPRING: a next-generation compressor for FASTQ data , 2018, Bioinform..

[21]  Mikel Hernaez,et al.  Genomic Data Compression , 2019, Annual Review of Biomedical Data Science.

[22]  Sanguthevar Rajasekaran,et al.  NRGC: a novel referential genome compression algorithm , 2016, Bioinform..

[23]  Xiaohui Xie,et al.  Sequence analysis Human genomes as email attachments , 2022 .

[24]  Sebastian Deorowicz,et al.  GDC 2: Compression of large collections of genomes , 2015, Scientific Reports.

[25]  Faraz Hach,et al.  Comparison of high-throughput sequencing data compression tools , 2016, Nature Methods.

[26]  Ulf Leser,et al.  FRESCO: Referential Compression of Highly Similar Sequences , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Jun He,et al.  Soluble Fiber and Insoluble Fiber Regulate Colonic Microbiota and Barrier Function in a Piglet Model , 2019, BioMed research international.

[28]  Sangsoo Kim,et al.  The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. , 2009, Genome research.

[29]  Wei Shi,et al.  High efficiency referential genome compression algorithm , 2019, Bioinform..

[30]  Justin Chu,et al.  ntHash: recursive nucleotide hashing , 2016, Bioinform..

[31]  Miguel Rocha,et al.  10th International Conference on Practical Applications of Computational Biology & Bioinformatics , 2016 .

[32]  Jinyan Li,et al.  High‐speed and high‐ratio referential genome compression , 2017, Bioinform..

[33]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[34]  Steven J. M. Jones,et al.  Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism. , 2015, The Plant journal : for cell and molecular biology.

[35]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[36]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[37]  B. Haas,et al.  A clustering method for repeat analysis in DNA sequences , 2001, Genome Biology.

[38]  Tatsuya Akutsu,et al.  Proteome compression via protein domain compositions. , 2014, Methods.

[39]  Mikel Hernaez,et al.  iDoComp: a compression scheme for assembled genomes , 2015, Bioinform..

[40]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[41]  Jinyan Li,et al.  Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression , 2019, Bioinform..

[42]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[43]  Sanguthevar Rajasekaran,et al.  ERGC: an efficient referential genome compression algorithm , 2015, Bioinform..

[44]  Armando J. Pinho,et al.  A Survey on Data Compression Methods for Biological Sequences , 2016, Inf..

[45]  Szymon Grabowski,et al.  PgRC: pseudogenome-based read compressor. , 2019, Bioinformatics.