Viral Genome Compression

Viruses compress their genome to reduce space. One of the main techniques is overlapping genes. We model this process by the shortest common superstring problem, that is, we look for the shortest genome which still contains all genes. We give an algorithm for computing optimal solutions which is slow in the number of strings but fast (linear) in their total length. This algorithm is used for a number of viruses with relatively few genes. When the number of genes is larger, we compute approximate solutions using the greedy algorithm which gives an upper bound for the optimal solution. We give also a lower bound for the shortest common superstring problem. The results obtained are then compared with what happens in nature. Remarkably, the compression obtained by viruses is quite high and also very close to the one achieved by modern computers.

[1]  David C. Krakauer,et al.  Evolutionary Principles of Genomic Compression , 2002 .

[2]  Clifford Stein,et al.  Improved Length Bounds for the Shortest Superstring Problem (Extended Abstract) , 1995, WADS.

[3]  Clifford Stein,et al.  Long tours and short superstrings , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[4]  Tao Jiang,et al.  Linear approximation of shortest superstrings , 1991, STOC '91.

[5]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[6]  F. Frances Yao,et al.  Approximating shortest superstrings , 1997, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[7]  Chris Armen Approximation algorithms for the shortest superstring problem , 1996 .

[8]  David A. Fenstermacher,et al.  Introduction to bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[9]  Tao Jiang,et al.  Rotations of Periodic Strings and Short Superstrings , 1996, J. Algorithms.

[10]  J. Allouche Algebraic Combinatorics on Words , 2005 .

[11]  Wojciech Rytter,et al.  Parallel and Sequential Approximations of Shortest Superstrings , 1994, SWAT.

[12]  Andrew Lever Principles of molecular virology (2nd edn) , 1997 .

[13]  Alan J. Cann,et al.  Principles of molecular virology , 1993 .

[14]  Borivoj Melichar,et al.  Finding Common Motifs with Gaps Using Finite Automata , 2006, CIAA.

[15]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[16]  Elizabeth Sweedyk,et al.  A 2½-Approximation Algorithm for Shortest Superstring , 1999, SIAM J. Comput..

[17]  Clifford Stein,et al.  A 2 2 3 {approximation Algorithm for the Shortest Superstring Problem , 1995 .

[18]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[19]  Mark Daley,et al.  Viral Gene Compression: Complexity and Verification , 2004, CIAA.

[20]  David Maier,et al.  On Finding Minimal Length Superstrings , 1980, J. Comput. Syst. Sci..

[21]  M. Lothaire,et al.  Algebraic Combinatorics on Words: Index of Notation , 2002 .