DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information theory are often perceived as being of interest for data communication and storage. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison of genomic databases. This paper presents a differential compression algorithm that is based on production of difference sequences according to op-code table in order to optimize the compression of homologous sequences in dataset. Therefore, the stored data are composed of reference sequence, the set of differences, and differences locations, instead of storing each sequence individually. This algorithm does not require a priori knowledge about the statistics of the sequence set. The algorithm was applied to three different datasets of genomic sequences, it achieved up to 195-fold compression rate corresponding to 99.4% space saving.

[1]  Shamkant B. Navathe,et al.  MITOMAP: a human mitochondrial genome database—2004 update , 2004, Nucleic Acids Res..

[2]  En-Hui Yang,et al.  Estimating DNA sequence entropy , 2000, SODA '00.

[3]  Allam Apparao,et al.  DNABIT Compress – Genome compression algorithm , 2011, Bioinformation.

[4]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[7]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[8]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[9]  Congmao Wang,et al.  A novel compression tool for efficient storage of genome resequencing data , 2011, Nucleic acids research.

[10]  Hyoung Do Kim,et al.  DNA Data Compression Based on the Whole Genome Sequence , 2009, J. Convergence Inf. Technol..

[11]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[12]  P. Raja Rajeswari,et al.  Genbit Compress Tool(GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio(bits/base) of Genomes , 2010, ArXiv.

[13]  G. Mahairas,et al.  Sequencing the human genome. , 1997, Science.

[14]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[15]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[16]  Claude E. Shannon,et al.  A Mathematical Theory of Communications , 1948 .

[17]  David Loewenstern,et al.  Significantly Lower Entropy Estimates for Natural DNA Sequences , 1999, J. Comput. Biol..

[18]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[19]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[20]  George Varghese,et al.  Compressing Genomic Sequence Fragments Using SlimGene , 2010, RECOMB.

[21]  Y. M. Kadah,et al.  Genomic Sequences Differential Compression Model , 2010 .

[22]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[23]  I. Tabus,et al.  Protein Is Compressible , 2004, Proceedings of the 6th Nordic Signal Processing Symposium, 2004. NORSIG 2004..

[24]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[25]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[26]  H. Hirsh,et al.  DNA Sequence Classification Using Compression-Based Induction , 1995 .

[27]  G. Blelloch Introduction to Data Compression * , 2022 .

[28]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[29]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[30]  Xiaohui Xie,et al.  Sequence analysis Human genomes as email attachments , 2022 .

[31]  Pierre Baldi,et al.  Data structures and compression algorithms for genomic sequence data , 2009, Bioinform..