MAFCO: A Compression Tool for MAF Files

In the last decade, the cost of genomic sequencing has been decreasing so much that researchers all over the world accumulate huge amounts of data for present and future use. These genomic data need to be efficiently stored, because storage cost is not decreasing as fast as the cost of sequencing. In order to overcome this problem, the most popular general-purpose compression tool, gzip, is usually used. However, these tools were not specifically designed to compress this kind of data, and often fall short when the intention is to reduce the data size as much as possible. There are several compression algorithms available, even for genomic data, but very few have been designed to deal with Whole Genome Alignments, containing alignments between entire genomes of several species. In this paper, we present a lossless compression tool, MAFCO, specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from 34% to 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. Both source-code and binaries for several operating systems are freely available for non-commercial use at: http://bioinformatics.ua.pt/software/mafco.

[1]  Rodrigo Cánovas,et al.  Practical compression for multi-alignment genomic files , 2013 .

[2]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[3]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[4]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[5]  Daniel MacLean,et al.  Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics , 2012, Bioinform..

[6]  Pavol Hanus,et al.  Compression of Whole Genome Alignments , 2010, IEEE Transactions on Information Theory.

[7]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[8]  Ulf Leser,et al.  FRESCO: Referential Compression of Highly Similar Sequences , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  R. Hardison Conserved noncoding sequences are reliable guides to regulatory elements. , 2000, Trends in genetics : TIG.

[10]  Armando J. Pinho,et al.  On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models , 2011, PloS one.

[11]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[12]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[13]  N. Popitsch,et al.  NGC: lossless and lossy compression of aligned high-throughput sequencing data , 2012, Nucleic acids research.

[14]  Raffaele Giancarlo,et al.  Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies , 2014, Briefings Bioinform..

[15]  Ioan Tabus,et al.  Normalized maximum likelihood model of order-1 for the compression of DNA sequences , 2007, 2007 Data Compression Conference (DCC'07).

[16]  Daniel J. Blankenberg,et al.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser. , 2007, Genome research.

[17]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[18]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[19]  Armando J. Pinho,et al.  A Compression Model for DNA Multiple Sequence Alignment Blocks , 2013, IEEE Transactions on Information Theory.

[20]  Armando J. Pinho,et al.  MFCompress: a compression tool for FASTA and multi-FASTA data , 2013, Bioinform..

[21]  Szymon Grabowski,et al.  Data compression for sequencing data , 2013, Algorithms for Molecular Biology.

[22]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[23]  Pjotr Prins,et al.  BioRuby: bioinformatics software for the Ruby programming language , 2010, Bioinform..

[24]  Armando J. Pinho,et al.  Compression of Whole Genome Alignments Using a Mixture of Finite-Context Models , 2012, ICIAR.

[25]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[26]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[27]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.