GenomeCompress : A Novel Algorithm for DNA Compression

The genome of an organism contains all hereditary information encoded in DNA. So it is extremely important to sequence the genome which determines how the organisms survive, develop and multiply. Since three decades, due to massive efforts on DNA sequencing, complete genome sequence of a large number of organisms including humans are now known and the genomic databases are growing exponentially with time. Also for the huge size of the genomes, an efficient algorithm is required to compress them. General text compression algorithms don’t utilize the specific characteristics of a DNA sequence. DNA specific compression algorithms exploit the repetitiveness of bases in DNA sequences. A repetitive DNA sequence can be best compressed using dictionary based compression algorithm. Non-repetitive parts of the DNA are generally compressed using dynamic programming, by dividing the sequences in square matrices which contain common repeat of a single base and then substituting the matrix with the base and putting the order of the matrix in a string. In this paper, a novel algorithm for DNA compression is proposed in order to compress both repetitive and non repetitive DNA sequence. The algorithm is also compared with existing ones and is found to achieve better compression ratio than the others.