K-means clustering based compression algorithm for the high-throughput DNA sequence

This paper proposes a compression algorithm based on K-means clustering for high-through DNA sequence (DNAC-K). In DNAC-K, we create cluster of sequences based on K-means clustering method at first, then iterate clusters according to the edit distances of subsequences, and finally, adopt Huffman coding to encode the result of clustering result. Experimental results on several sequencing data sets demonstrate better performance of DNAC-K than many of the current high-throughput DNA sequence compression algorithms.

[1]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[2]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[3]  Gonzalo Navarro,et al.  Compressing Huffman Models on Large Alphabets , 2013, 2013 Data Compression Conference.

[4]  Kong Dexi,et al.  A Fast and Effective Kernel-Based K-Means Clustering Algorithm , 2013, 2013 Third International Conference on Intelligent System Design and Engineering Applications.

[5]  A. Gupta,et al.  An efficient compressor for biological sequences , 2013, 2013 3rd IEEE International Advance Computing Conference (IACC).

[6]  Justin Zobel,et al.  Iterative Dictionary Construction for Compression of Large DNA Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[8]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[9]  Ioan Tabus,et al.  An efficient normalized maximum likelihood algorithm for DNA sequence compression , 2005, TOIS.

[10]  Congmao Wang,et al.  A novel compression tool for efficient storage of genome resequencing data , 2011, Nucleic acids research.

[11]  J. Rissanen,et al.  DNA sequence compression - Based on the normalized maximum likelihood model , 2007, IEEE Signal Processing Magazine.

[12]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[13]  Chen Si-ping Intelligent DNA Sequence Data Compression Using Memetic Algorithm , 2013 .

[14]  Barileé B. Baridam,et al.  Investigating the Particle Swarm Optimization Clustering Method on Nucleic Acid Sequences , 2011 .