论文信息 - Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem

Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem

Motivated by the goal of discovering hierarchical structures inside DNA sequences, we address the Smallest Grammar Problem, the problem of finding a smallest context-free grammar that generates exactly one sequence. This NP-Hard problem has been widely studied for applications like Data Compression, Structure Discovery and Algorithmic Information Theory. From the theoretical point of view, our contributions to this problem is a new formalisation of the Smallest Grammar Problem based on two complementary optimisation problems: the choice of constituents of the final grammar and the choice of how to parse the sequence with these constituents. We give a polynomial time solution for this last problem, which me named the ''Minimal Grammar Parsing" problem. This decomposition allows us to define a new complete and correct search space for the Smallest Grammar Problem. Based on this search space, we propose new algorithms able to return grammars 10\% smaller than the state of the art on complete genomes. Regarding efficiency, we study different equivalence classes of repeats and introduce an efficient in-place schema to update the suffix array data structure used to compute these words. We conclude this thesis analysing the applications. For Structure Discovery, we consider the impact of the non-uniqueness of smallest grammars. We prove that the number of smallest grammars can be exponential in the size of the sequence and then analyse the stability of the discovered structures between minimal grammars for real-life examples. With respect to Data Compression, we extend our algorithms to use rigid patterns as words and achieve compression rate up to 25\% better compared to the previous best DNA grammar-based coder.

Matthias Gallé | Matthias Gallé

[1] Mario Gimona,et al. Protein linguistics — a grammar for modular protein assembly? , 2006, Nature Reviews Molecular Cell Biology.

[2] Alberto Apostolico,et al. Incremental Paradigms of Motif Discovery , 2004, J. Comput. Biol..

[3] Abhi Shelat,et al. The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[4] Menno van Zaanen,et al. Comparing Two Unsupervised Grammar Induction Systems: Alignment-Based Learning vs. EMILE , 2001 .

[5] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[6] H. S. Heaps,et al. A comparison of algorithms for data base compression by use of fragments as language elements , 1974, Inf. Storage Retr..

[7] Eytan Ruppin,et al. Unsupervised learning of natural languages , 2006 .

[8] A. H. Lipkus. A proof of the triangle inequality for the Tanimoto distance , 1999 .

[9] Paul Pritchard. On Computing the Subset Graph of a Collection of Sets , 1999, J. Algorithms.

[10] Gad M. Landau,et al. Random access to grammar-compressed strings , 2010, SODA '11.

[11] Aleksandar Milosavljevic,et al. Discovery by Minimal Length Encoding: A case study in molecular evolution , 1993, Machine Learning.

[12] A. Apostolico,et al. Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[13] Ming Gu,et al. An efficient algorithm for dynamic text indexing , 1994, SODA '94.

[14] Eugene W. Myers,et al. Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[15] W. Ebeling,et al. On grammars, complexity, and information measures of biological macromolecules , 1980 .

[16] Sen Zhang,et al. Fast and Space Efficient Linear Suffix Array Construction , 2008, Data Compression Conference (dcc 2008).

[17] Matthias Gallé,et al. Searching for smallest grammars on large sequences and application to DNA , 2012, J. Discrete Algorithms.

[18] Hsiang-Chuan Liu,et al. Scaling Behavior of Maximal Repeat Distributions in Genomic Sequences , 2008, Int. J. Cogn. Informatics Nat. Intell..

[19] Enno Ohlebusch,et al. Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[20] Khalid Sayood,et al. Data Compression Concepts and Algorithms and Their Applications to Bioinformatics , 2009, Entropy.

[21] Matteo Comin,et al. Motifs in Ziv-Lempel-Welch Clef , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[22] Ian H. Witten,et al. Browsing in digital libraries: a phrase-based approach , 1997, DL '97.

[23] Ayumi Shinohara,et al. Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[24] Raju Uma,et al. A New Algorithm For Data Compression , 2013 .

[25] Alexander Clark,et al. Three Learnable Models for the Description of Language , 2010, LATA.

[26] Paolo Ferragina. Data Structures: Time, I/Os, Entropy, Joules! , 2010, ESA.

[27] Igor Potapov,et al. Real-time traversal in grammar-based compressed files , 2005, Data Compression Conference.

[28] Chris Mellish,et al. Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences , 2001, Bioinform..

[29] D Larhammar,et al. Lack of biological significance in the 'linguistic features' of noncoding DNA--a quantitative analysis. , 1996, Nucleic acids research.

[30] F. Crick. Central Dogma of Molecular Biology , 1970, Nature.

[31] Matthew Simon. Emergent computation - emphasizing bioinformatics , 2005, Biological and medical physics biomedical engineering.

[32] Ralph Grishman,et al. A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars , 1991, HLT.

[33] Stéphane Grumbach,et al. A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[34] E. Mark Gold,et al. Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[35] Stephen F. Bush,et al. Kolmogorov complexity estimation and application for information system security , 2003 .

[36] Edward M. McCreight,et al. A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[37] Toshiko Matsumoto,et al. Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[38] Robert A. Wagner,et al. Common phrases and minimum-space text storage , 1973, CACM.

[39] H. Kuhn. The Hungarian method for the assignment problem , 1955 .

[40] Pedro M. Domingos. The Role of Occam's Razor in Knowledge Discovery , 1999, Data Mining and Knowledge Discovery.

[41] Patrick Argos,et al. The Language of Protein Folding: Many Forked Tongues , 1992, Comput. Chem..

[42] Judith Roof,et al. The Poetics of DNA , 2007 .

[43] Alaa A. Kharbouch,et al. Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[44] Amaury Habrard,et al. A Polynomial Algorithm for the Inference of Context Free Languages , 2008, ICGI.

[45] Maxime Crochemore,et al. Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46] Matías Bordese. Análisis y alternativas para la compresión de XML , 2009 .

[47] Uzi Vishkin,et al. Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[48] S Ji,et al. The Linguistics of DNA: Words, Sentences, Grammar, Phonetics, and Semantics , 1999, Annals of the New York Academy of Sciences.

[49] Eli Upfal,et al. MADMX: A Novel Strategy for Maximal Dense Motif Extraction , 2009, WABI.

[50] Jacques Nicolas,et al. Browsing repeats in genomes: Pygram and an application to non-coding region analysis , 2006, BMC Bioinformatics.

[51] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[52] Pamela C. Cosman,et al. Universal lossless compression via multilevel pattern matching , 2000, IEEE Trans. Inf. Theory.

[53] Ian H. Witten,et al. Linear-time, incremental hierarchy inference for compression , 1997, Proceedings DCC '97. Data Compression Conference.

[54] Shmuel Tomi Klein,et al. Compression, information theory, and grammars: a unified approach , 1990, TOIS.

[55] Jonathan Miller,et al. MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress , 2007, EURASIP J. Bioinform. Syst. Biol..

[56] Hiroki Arimura,et al. An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence , 2007, J. Comb. Optim..

[57] Matthias Gallé,et al. The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing , 2011, Algorithms.

[58] Wing-Kai Hon,et al. Compressed indexes for dynamic text collections , 2007, TALG.

[59] Hiroshi Sakamoto,et al. A Space-Saving Linear-Time Algorithm for Grammar-Based Compression , 2004, SPIRE.

[60] Yong Zhang,et al. DNA sequence compression using the Burrows-Wheeler Transform , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[61] Ian H. Witten,et al. Inferring lexical and grammatical structure from sequences , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[62] Paul M. B. Vitányi,et al. Clustering by compression , 2003, IEEE Transactions on Information Theory.

[63] Xin Chen,et al. A compression algorithm for DNA sequences , 2001, IEEE Engineering in Medicine and Biology Magazine.

[64] Amaury Habrard,et al. A Note on Contextual Binary Feature Grammars , 2009 .

[65] Cristian S. Calude,et al. Finite-State Complexity and the Size of Transducers , 2010, DCFS.

[66] Elena Rivas,et al. The language of RNA: a formal grammar that includes pseudoknots , 2000, Bioinform..

[67] Dan Klein,et al. Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency , 2004, ACL.

[68] Robert D. Cameron. Source encoding using syntactic information source models , 1988, IEEE Trans. Inf. Theory.

[69] Ming Li,et al. An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[70] Jorma Rissanen,et al. Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[71] T G Dewey,et al. The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[72] Geoffrey Sampson,et al. A proposal for improving the measurement of parse accuracy , 2000 .

[73] Alistair Moffat,et al. Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[74] Sérgio Deusdado,et al. Análise e compressão de sequências genómicas , 2008 .

[75] Atsuhiro Takasu,et al. Approximating Tree Edit Distance through String Edit Distance , 2008, Algorithmica.

[76] Philip Bille,et al. A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[77] David B. Searls,et al. Linguistic approaches to biological sequences , 1997, Comput. Appl. Biosci..

[78] Tomasz Müldner,et al. AXECHOP: a grammar-based compressor for XML , 2005, Data Compression Conference.

[79] E. Mark Gold,et al. Language Identification in the Limit , 1967, Inf. Control..

[80] Srinivas Aluru,et al. Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[81] M. Steel,et al. Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees , 2001 .

[82] Gerald Gazdar,et al. Applicability of Indexed Grammars to Natural Languages , 1988 .

[83] Aristotelis Tsirigos,et al. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[84] Makoto Kanazawa,et al. The Copying Power of Well-Nested Multiple Context-Free Grammars , 2010, LATA.

[85] Kunihiko Sadakane,et al. Faster suffix sorting , 2007, Theoretical Computer Science.

[86] Roberto Grossi,et al. On Updating Suffix Tree Labels , 1998, Theor. Comput. Sci..

[87] William F. Smyth,et al. A taxonomy of suffix array construction algorithms , 2007, CSUR.

[88] R. Eyraud. Inférence grammatical de langages hors-contextes , 2006 .

[89] Rens Bod,et al. The Data-Oriented Parsing Approach: Theory and Application , 2008, Computational Intelligence: A Compendium.

[90] Eric Steinbrecher,et al. Implementation of an Incremental MDL-Based Two Part Compression Algorithm for Model Inference , 2009, 2009 Data Compression Conference.

[91] Eric Lehman,et al. Approximation algorithms for grammar-based data compression , 2002 .

[92] Jean-Paul Delahaye,et al. A guaranteed compression scheme for repetitive DNA sequences , 1996, Proceedings of Data Compression Conference - DCC '96.

[93] E N Trifonov,et al. The multiple codes of nucleotide sequences. , 1989, Bulletin of mathematical biology.

[94] Craig G. Nevill-Manning,et al. Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[95] Dana Angluin,et al. Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[96] M A Nowak,et al. Explaining "linguistic features" of noncoding DNA. , 1996, Science.

[97] F Flam,et al. Hints of a language in junk DNA. , 1994, Science.

[98] Pierre Peterlongo,et al. In-Place Update of Suffix Array while Recoding Words , 2008, Int. J. Found. Comput. Sci..

[99] Gad M. Landau,et al. Unified Compression-Based Acceleration of Edit-Distance Computation , 2011, Algorithmica.

[100] Johann Pelfrêne,et al. Extracting approximate patterns , 2005, J. Discrete Algorithms.

[101] Matthias Gallé,et al. Choosing Word Occurrences for the Smallest Grammar Problem , 2010, LATA.

[102] Raffaele Giancarlo,et al. Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[103] Gonzalo Navarro,et al. Re-pair Achieves High-Order Entropy , 2008, Data Compression Conference (dcc 2008).

[104] James A. Storer,et al. Data compression via textual substitution , 1982, JACM.

[105] David Loewenstern,et al. Significantly lower entropy estimates for natural DNA sequences , 1997, Proceedings DCC '97. Data Compression Conference.

[106] Gregory Stephanopoulos,et al. A linguistic model for the rational design of antimicrobial peptides , 2006, Nature.

[107] William F. Smyth,et al. Fast Optimal Algorithms for Computing All the Repeats in a String , 2008, Stringology.

[108] Jacques Nicolas,et al. CRISPI: a CRISPR interactive database , 2009, Bioinform..

[109] Pedro A. Pury,et al. Statistical keyword detection in literary corpora , 2007, ArXiv.

[110] Frederick P. Brooks,et al. Three great challenges for half-century-old computer science , 2003, JACM.

[111] Abraham Lempel,et al. Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[112] Yasubumi Sakakibara,et al. Efficient Learning of Context-Free Grammars from Positive Structural Examples , 1992, Inf. Comput..

[113] Paul Pritchard,et al. A Simple Sub-Quadratic Algorithm for Computing the Subset Partial Order , 1995, Inf. Process. Lett..

[114] I.H. Witten,et al. On-line and off-line heuristics for inferring hierarchies of repetitions in sequences , 2000, Proceedings of the IEEE.

[115] Philip Gage,et al. A new algorithm for data compression , 1994 .

[116] Timothy C. Bell,et al. A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[117] Jeong Seop Sim. Time and Space Efficient Search for Small Alphabets with Suffix Arrays , 2005, FSKD.

[118] M. Nowak,et al. No signs of hidden language in noncoding DNA. , 1996, Physical review letters.

[119] Franco P. Preparata,et al. Data structures and algorithms for the string statistics problem , 1996, Algorithmica.

[120] V. Brendel,et al. Genome structure described by formal languages. , 1984, Nucleic acids research.

[121] Matthias Gallé. A New Tree Distance Metric for Structural Comparison of Sequences , 2010, Structure Discovery in Biology: Motifs, Networks & Phylogenies.

[122] Ayumi Shinohara,et al. Linear-Time Text Compression by Longest-First Substitution , 2009, Algorithms.

[123] Giovanni Manzini,et al. Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[124] Jeffrey D. Ullman,et al. Introduction to Automata Theory, Languages and Computation , 1979 .

[125] Travis Gagie,et al. Grammar-Based Compression in a Streaming Model , 2009, LATA.

[126] J. Wolff. AN ALGORITHM FOR THE SEGMENTATION OF AN ARTIFICIAL LANGUAGE ANALOGUE , 1975 .

[127] G.J. Saulnier,et al. Minimum description length principles for detection and classification of FTP exploits , 2004, IEEE MILCOM 2004. Military Communications Conference, 2004..

[128] S Ji,et al. The cell as the smallest DNA-based molecular computer. , 1999, Bio Systems.

[129] A A Tsonis,et al. Is DNA a language? , 1997, Journal of theoretical biology.

[130] Jon Louis Bentley,et al. Data compression using long common strings , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[131] Menno van Zaanen,et al. Bootstrapping structure into language : alignment-based learning , 2001, ArXiv.

[132] Paolo Ferragina,et al. Text Compression , 2009, Encyclopedia of Database Systems.

[133] Perrin Matthieu. Compression de séquences d'A.D.N. à base de grammaires minimales , 2010 .

[134] Daniel M. Yellin. Algorithms for subset testing and finding maximal sets , 1992, SODA '92.

[135] Volker Brendel,et al. Gnomic : a dictionary of genetic codes , 1986 .

[136] Yasubumi Sakakibara,et al. Learning context-free grammars using tabular representations , 2005, Pattern Recognit..

[137] Edward R. Fiala,et al. Data compression with finite windows , 1989, CACM.

[138] Hélène Touzet,et al. A Linear Tree Edit Distance Algorithm for Similar Ordered Trees , 2005, CPM.

[139] Behshad Behzadi,et al. DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[140] O Popov,et al. Linguistic complexity of protein sequences as compared to texts of human languages. , 1996, Bio Systems.

[141] Pang Ko,et al. Linear Time Construction of Suffix Arrays , 2002 .

[142] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[143] Trevor I. Dix,et al. A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[144] Pierre Peterlongo,et al. Modeling local repeats on genomic sequences , 2008 .

[145] Gang Chen,et al. Lempel–Ziv Factorization Using Less Time & Space , 2008, Math. Comput. Sci..

[146] Anna Pagh,et al. Solving the String Statistics Problem in Time O(n log n) , 2002, ICALP.

[147] Colin de la Higuera,et al. Grammatical Inference: Learning Automata and Grammars , 2010 .

[148] Laurent Mouchard,et al. Dynamic Burrows-Wheeler Transform , 2008, Stringology.

[149] J. Collado-Vides,et al. Grammatical model of the regulation of gene expression. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[150] Craig G. Nevill-Manning,et al. Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[151] D. Searls,et al. Robots in invertebrate neuroscience , 2002, Nature.

[152] Gregory Kucherov,et al. Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[153] Bin Ma,et al. PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[154] Maxime Crochemore,et al. A Comparative Study of Bases for Motif Inference in String Algorithmics , 2004 .

[155] David B. Searls,et al. The computational linguistics of biological sequences , 1993, ISMB 1995.

[156] Giovanni Manzini,et al. A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[157] Rafael. Carrascosa. Gramáticas mínimas y descubrimiento de patrones , 2010 .

[158] En-Hui Yang,et al. Estimating DNA sequence entropy , 2000, SODA '00.

[159] Quanzhong Li,et al. Supporting efficient query processing on compressed XML files , 2005, SAC '05.

[160] Chan,et al. Can Zipf distinguish language from noise in noncoding DNA? , 1996, Physical review letters.

[161] Hiroshi Sakamoto,et al. A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[162] Chun Chen,et al. RNACompress: Grammar-based compression and informational complexity measurement of RNA secondary structure , 2008, BMC Bioinformatics.

[163] Jyrki Katajainen,et al. An analysis of the longest match and the greedy heuristics in text encoding , 1992, JACM.

[164] Sean R. Eddy,et al. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[165] Menno van Zaanen,et al. ABL: Alignment-Based Learning , 2000, COLING.

[166] Amir Averbuch,et al. XML syntax conscious compression , 2006, Data Compression Conference (DCC'06).

[167] Temple F. Smith. Occam's razor , 1980, Nature.

[168] Matteo Comin,et al. VARUN: Discovering Extensible Motifs under Saturation Constraints , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[169] Tomasz Müldner,et al. A Grammar-based Approach for Compressing XML , 2005 .

[170] Christopher D. Manning,et al. The unsupervised learning of natural language structure , 2005 .

[171] Michael D. Hendy,et al. Compressing DNA sequence databases with coil , 2007, BMC Bioinformatics.

[172] Sherif Sakr,et al. XML compression techniques: A survey and comparison , 2009, J. Comput. Syst. Sci..

[173] Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2003, Theor. Comput. Sci..

[174] H E Stanley,et al. Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[175] Richard E. Ladner,et al. Grammar-based Compression of DNA Sequences , 2007 .

[176] Rudi Cilibrasi,et al. Statistical inference through data compression , 2007 .

[177] H. Judson. The Eighth Day of Creation: Makers of the Revolution in Biology , 2013 .

[178] Akihiko Konagaya,et al. DNA Data Compression in the Post Genome Era , 2001 .

[179] Dan R. Olsen,et al. Compressing semi-structured text using hierarchical phrase identifications , 1996, Proceedings of Data Compression Conference - DCC '96.

[180] Simon J. Puglisi,et al. An efficient, versatile approach to suffix sorting , 2008, JEAL.

[181] Pieter W. Adriaans,et al. The EMILE 4.1 Grammar Induction Toolbox , 2002, ICGI.

[182] Ricardo A. Baeza-Yates,et al. A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[183] M. Neumüller,et al. Compression of XML Data , 2001 .

[184] Hiroshi Sakamoto,et al. Improving Time and Space Complexity for Compressed Pattern Matching , 2006, ISAAC.

[185] David B. Searls,et al. Trees of life and of language , 2003 .

[186] D. Fisher. The Eighth Day of Creation: Makers of the Revolution in Biology , 1979 .

[187] M. A. Jiménez-Montaño,et al. On the syntactic structure of protein sequences and the concept of grammar complexity , 1984 .

[188] Miguel A. Martínez-Prieto,et al. Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[189] Ian H. Witten,et al. Phrase hierarchy inference and compression in bounded space , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[190] Goulven Kerbellec,et al. Apprentissage d'automates modélisant des familles de séquences protéiques. (Learning automata modelling families of protein sequences) , 2008 .

[191] Amr Elmasry,et al. The Subset Partial Order: Computing and Combinatorics , 2010, ANALCO.

[192] Hiroshi Sakamoto,et al. Context-sensitive grammar transform: Compression and pattern matching , 2008 .

[193] Jacques Nicolas,et al. Genome analysis Suffix-tree analyser ( STAN ) : looking for nucleotidic and peptidic patterns in chromosomes , 2005 .

[194] David B. Searls,et al. String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA , 1995, J. Log. Program..

[195] Ioan Tabus,et al. DNA sequence compression using the normalized maximum likelihood model for discrete regression , 2003, Data Compression Conference, 2003. Proceedings. DCC 2003.

[196] Gonzalo Navarro,et al. Compressed full-text indexes , 2007, CSUR.

[197] Julio Collado-Vides,et al. The search for a grammatical theory of gene regulation is formally justified by showing the inadequacy of context-free grammars , 1991, Comput. Appl. Biosci..

[198] Alberto Apostolico,et al. Optimal Offline Extraction of Irredundant Motif Bases , 2007, COCOON.

[199] Colin de la Higuera,et al. LARS: A learning algorithm for rewriting systems , 2006, Machine Learning.

[200] Matteo Comin,et al. Classification of protein sequences by means of irredundant patterns , 2010, BMC Bioinformatics.

[201] Bin Ma,et al. DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[202] Jens Stoye,et al. An incomplex algorithm for fast suffix array construction , 2007, ALENEX/ANALCO.

[203] Michael Gribskov. The Language Metaphor in Sequence Analysis , 1992, Comput. Chem..

[204] Pieter W. Adriaans. Learning as Data Compression , 2007, CiE.

[205] Craig G. Nevill-Manning,et al. Inferring Sequential Structure , 1996 .

[206] Peter Sanders,et al. Simple Linear Work Suffix Array Construction , 2003, ICALP.

[207] D. B. Searls,et al. Reading the book of life , 2001, Bioinform..

[208] Dake He,et al. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[209] David B. Searls,et al. Grammatical Representations of Macromolecular Structure , 2006, J. Comput. Biol..

[210] Stefano Lonardi,et al. Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[211] Esko Ukkonen,et al. Maximal and minimal representations of gapped and non-gapped motifs of a string , 2009, Theor. Comput. Sci..

[212] Esko Ukkonen,et al. On-line construction of suffix trees , 1995, Algorithmica.

[213] Gonzalo Navarro,et al. Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[214] Trevor I. Dix,et al. Compression of Strings with Approximate Repeats , 1998, ISMB.

[215] Raffaele Giancarlo,et al. Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[216] David J. C. MacKay,et al. Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[217] Mihai Datcu,et al. A Similarity Measure Using Smallest Context-Free Grammars , 2010, 2010 Data Compression Conference.

[218] Wing-Kai Hon,et al. Compression, Indexing, and Retrieval for Massive String Data , 2010, CPM.

[219] Bin Ma,et al. The similarity metric , 2001, IEEE Transactions on Information Theory.

[220] Christian N. S. Pedersen,et al. Solving the String Statistics Problem in Time O(n log n) , 2002 .

[221] G. Korodi,et al. Compression of Annotated Nucleotide Sequences , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[222] Menno van Zaanen. ABL: Alignment-Based Learning , 2000, COLING.

[223] Ian H. Witten,et al. Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[224] Abhi Shelat,et al. Approximating the smallest grammar: Kolmogorov complexity in natural models , 2002, STOC '02.

[225] Xiaohui Xie,et al. Sequence analysis Human genomes as email attachments , 2022 .

[226] En-Hui Yang,et al. Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[227] Ayumi Shinohara,et al. Simple Linear-Time Off-Line Text Compression by Longest-First Substitution , 2007, 2007 Data Compression Conference (DCC'07).

[228] Alberto Apostolico,et al. Fast gapped variants for Lempel-Ziv-Welch compression , 2007, Inf. Comput..

[229] Jean-Paul Delahaye,et al. Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences , 1997, Comput. Appl. Biosci..

[230] Wing-Kai Hon,et al. I/O-Efficient Compressed Text Indexes: From Theory to Practice , 2010, 2010 Data Compression Conference.

[231] J. Wolff. Learning Syntax and Meanings Through Optimization and Distributional Analysis , 1988 .

[232] Alexander Clark,et al. Learning deterministic context free grammars: The Omphalos competition , 2006, Machine Learning.

[233] Matteo Comin,et al. Bridging Lossy and Lossless Compression by Motif Pattern Discovery , 2005, Electron. Notes Discret. Math..