Identifying Hierarchical Structure in Sequences: A linear-time algorithm

SEQUITUR is an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing repeated phrases with a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence, which offers insights into its lexical structure. The algorithm is driven by two constraints that reduce the size of the grammar, and produce structure as a by-product. SEQUITUR breaks new ground by operating incrementally. Moreover, the method's simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 50,000 symbols per second and has been applied to an extensive range of real world sequences.

[1]  J. Wolff,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980, Language and speech.

[2]  J. Gerard Wolff,et al.  Language acquisition, data compression and generalization , 1982 .

[3]  Azriel Rosenfeld,et al.  Grammatical inference by hill climbing , 1976, Inf. Sci..

[4]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[5]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[6]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[7]  J. Wolff AN ALGORITHM FOR THE SEGMENTATION OF AN ARTIFICIAL LANGUAGE ANALOGUE , 1975 .

[8]  Pat Langley Simplicity and Representation Change in Grammar Induction , 1995 .

[9]  Craig G. Nevill-Manning,et al.  Inferring Sequential Structure , 1996 .

[10]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[11]  J. Wolff The discovery of segments in natural language , 1977 .

[12]  John H. Andreae,et al.  Thinking with the teachable machine , 1977 .

[13]  Henry Lieberman,et al.  Watch what I do: programming by demonstration , 1993 .

[14]  J. Gerard Wolfp,et al.  Language Acquisition and the Discovery of Phrase Structure , 1980 .

[15]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[16]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[17]  Ian H. Witten,et al.  Browsing in digital libraries: a phrase-based approach , 1997, DL '97.

[18]  Brian R. Gaines,et al.  Behaviour/structure transformations under uncertainty , 1976 .

[19]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[20]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[21]  R. M. Wharton Grammar Enumeration and Inference , 1977, Inf. Control..