Compression of Amino Acid Sequences

Amino acid sequences are known to be very hard to compress. In this paper, we propose a lossless compressor for efficient compression of amino acid sequences (AC). The compressor uses a cooperation between multiple context and substitutional tolerant context models. The cooperation between models is balanced with weights that benefit the models with better performance, according to a forgetting function specific for each model. We have shown consistently better compression results than other approaches, using low computational resources. The compressor implementation is freely available, under license GPLv3, at https://github.com/pratas/ac.

[1]  Francesc Rosselló,et al.  Compression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances , 2006, ArXiv.

[2]  Tatsuya Akutsu,et al.  Proteome compression via protein domain compositions. , 2014, Methods.

[3]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[4]  I. Tabus,et al.  Protein Is Compressible , 2004, Proceedings of the 6th Nordic Signal Processing Symposium, 2004. NORSIG 2004..

[5]  Armando J. Pinho,et al.  Efficient Compression of Genomic Sequences , 2016, 2016 Data Compression Conference (DCC).

[6]  Armando J. Pinho,et al.  Substitutional Tolerant Markov Models for Relative Compression of DNA Sequences , 2017, PACBB.

[7]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[8]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[9]  Armando J. Pinho,et al.  A Survey on Data Compression Methods for Biological Sequences , 2016, Inf..

[10]  Donald A. Adjeroh,et al.  The SCP and compressed domain analysis of biological sequences , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[11]  Giovanni Manzini,et al.  Burrows-Wheeler Transform , 2016, Encyclopedia of Algorithms.

[12]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[13]  Natalio Krasnogor,et al.  Protein Structure Comparison through Fuzzy Contact Maps and the Universal Similarity Metric , 2005, EUSFLAT Conf..

[14]  Khalid Sayood,et al.  Data Compression Concepts and Algorithms and Their Applications to Bioinformatics , 2009, Entropy.

[15]  A. Lapedes,et al.  Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[17]  Emanuele Caglioti,et al.  Compressing Proteomes: The Relevance of Medium Range Correlations , 2007, EURASIP J. Bioinform. Syst. Biol..

[18]  Donald A. Adjeroh,et al.  On compressibility of protein sequences , 2006, Data Compression Conference (DCC'06).