Words and special factors

In this paper we consider sets of factors of a given finite word over a finite alphabet which permit us to reconstruct the entire word. This analysis is based on the notion of special factor. A factor u of a finite word w is called right (resp. left) special if there exist two distinct letters x and y such that ux, uy (resp. xu, yu) are factors of w. A factor is bispecial if it is right and left special. A proper box of w is any factor of w of the kind asb, with a,b letters and s a bispecial factor of w. The initial (resp. terminal) box of w is the shortest prefix (resp. suffix) of w which is an unrepeated factor. A box is called maximal if it is not a proper factor of another box. The main result of the paper is the following theorem (maximal box theorem): Any finite word w is uniquely determined by the initial box, the terminal box and the set of maximal boxes. A consequence is that a finite word w is uniquely determined by the knowledge of its factors up to the length n=max{Rw,Kw}+1, where Kw is the length of the terminal box and Rw is the minimal natural number for which there is no right special factor of length Rw. Some structural properties of boxes are studied. Another important combinatorial notion is that of superbox. A superbox is any factor of w of the kind asb, with a,b letters and such that s is a repeated factor, whereas as and sb are unrepeated factors. A theorem for superboxes similar to the maximal box theorem is proved. Some algorithms allowing us to construct boxes and superboxes and, conversely, to reconstruct the word are given. In this combinatorial frame we give an upper and a lower bound to the number of states of a minimal deterministic automaton recognizing the set of the factors of w. These bounds are sharper than the known bounds.

[1]  Jeffrey Shallit,et al.  Automaticity I: Properties of a Measure of Descriptional Complexity , 1996, J. Comput. Syst. Sci..

[2]  Aldo de Luca,et al.  On the Combinatorics of Finite Words , 1999, Theor. Comput. Sci..

[3]  Aldo de Luca,et al.  Words and repeated factors. , 2001 .

[4]  Grzegorz Rozenberg,et al.  Handbook of Formal Languages , 1997, Springer Berlin Heidelberg.

[5]  J. Berstel,et al.  Theory of codes , 1985 .

[6]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[7]  Aldo de Luca,et al.  Some Combinatorial Properties of the Thue-Morse Sequence and a Problem in Semigroups , 1989, Theor. Comput. Sci..

[8]  Jeffrey Shallit,et al.  Automaticity II: Descriptional Complexity in the Unary Case , 1997, Theor. Comput. Sci..

[9]  Aldo de Luca,et al.  On Bispecial Factors of the Thue-Morse Word , 1994, Inf. Process. Lett..

[10]  Antonio Restivo,et al.  Automata and Forbidden Words , 1998, Inf. Process. Lett..

[11]  Antonio Restivo,et al.  Minimal Forbidden Words and Symbolic Dynamics , 1996, STACS.

[12]  Aldo de Luca,et al.  On the Factors of the Thue-Morse Word on Three Symbols , 1988, Inf. Process. Lett..

[13]  Maxime Crochemore,et al.  Automata for Matching Patterns , 1997, Handbook of Formal Languages.

[14]  Julien Cassaigne,et al.  Complexité et facteurs spéciaux , 1997 .

[15]  Filippo Mignosi,et al.  Some Combinatorial Properties of Sturmian Words , 1994, Theor. Comput. Sci..

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.