Searching for smallest grammars on large sequences and application to DNA

Motivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms.

[1]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[2]  Hsiang-Chuan Liu,et al.  Scaling Behavior of Maximal Repeat Distributions in Genomic Sequences , 2008, Int. J. Cogn. Informatics Nat. Intell..

[3]  Takuya Kida,et al.  A Space-Saving Approximation Algorithm for Grammar-Based Compression , 2009, IEICE Trans. Inf. Syst..

[4]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  Stefano Lonardi,et al.  Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[7]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Franco P. Preparata,et al.  Data structures and algorithms for the string statistics problem , 1996, Algorithmica.

[9]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[10]  Anna Pagh,et al.  Solving the String Statistics Problem in Time O(n log n) , 2002, ICALP.

[11]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[12]  Pierre Peterlongo,et al.  In-Place Update of Suffix Array while Recoding Words , 2008, Int. J. Found. Comput. Sci..

[13]  Jon Louis Bentley,et al.  Data compression using long common strings , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[14]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[15]  Christopher D. Manning,et al.  The unsupervised learning of natural language structure , 2005 .

[16]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[17]  Muriel Beadle,et al.  The language of life : an introduction to the science of genetics , 1966 .

[19]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[20]  Christian N. S. Pedersen,et al.  Solving the String Statistics Problem in Time O(n log n) , 2002 .

[21]  Wojciech Rytter,et al.  An Efficient Pattern-Matching Algorithm for Strings with Short Descriptions , 1997, Nord. J. Comput..

[22]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[23]  Sen Zhang,et al.  Two Efficient Algorithms for Linear Time Suffix Array Construction , 2011, IEEE Transactions on Computers.

[24]  Wojciech Rytter,et al.  On the Maximal Number of Cubic Runs in a String , 2010, LATA.

[25]  Timothy C. Bell,et al.  A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[26]  Ayumi Shinohara,et al.  Linear-Time Text Compression by Longest-First Substitution , 2009, Algorithms.

[27]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[28]  Sen Zhang,et al.  Fast and Space Efficient Linear Suffix Array Construction , 2008, Data Compression Conference (dcc 2008).

[29]  William F. Smyth,et al.  Fast Optimal Algorithms for Computing All the Repeats in a String , 2008, Stringology.

[30]  Matthias Gallé,et al.  Choosing Word Occurrences for the Smallest Grammar Problem , 2010, LATA.

[31]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[32]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[33]  I.H. Witten,et al.  On-line and off-line heuristics for inferring hierarchies of repetitions in sequences , 2000, Proceedings of the IEEE.