Asymptotic Analysis of the kth Subword Complexity

Patterns within strings enable us to extract vital information regarding a string’s randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth Subword Complexity of the character string. By definition, the kth Subword Complexity is the number of distinct substrings of length k that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the kth Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns’ auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of k=Θ(logn). In the proof, we compare the distribution of the kth Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis.

[1]  Franklin Mark Liang Word hy-phen-a-tion by com-put-er , 1983 .

[2]  Philippe Jacquet,et al.  Autocorrelation on Words and Its Applications - Analysis of Suffix Trees by String-Ruler Approach , 1994, J. Comb. Theory A.

[3]  S. Karlin,et al.  Over- and under-representation of short oligonucleotides in DNA sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Paolo Ferragina,et al.  Text Compression , 2009, Encyclopedia of Database Systems.

[5]  Symbolic dynamics , 2008, Scholarpedia.

[6]  P. Pevzner,et al.  Linguistics of nucleotide sequences. II: Stationary words in genetic texts and the zonal structure of DNA. , 1989, Journal of biomolecular structure & dynamics.

[7]  Andrzej Ehrenfeucht,et al.  Subword Complexities of Various Classes of Deterministic Developmental Languages without Interactions , 1975, Theor. Comput. Sci..

[8]  B. L. Waerden On the method of saddle points , 1952 .

[9]  S. Karlin,et al.  Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. , 1996, Nucleic acids research.

[10]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[11]  David R. Wolf,et al.  Base compositional structure of genomes. , 1992, Genomics.

[12]  Philippe Flajolet,et al.  Analytic Combinatorics , 2009 .

[13]  M. Lothaire,et al.  Applied Combinatorics on Words , 2005 .

[14]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[15]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[16]  Mark Daniel Ward,et al.  On Correlation Polynomials and Subword Complexity , 2007 .

[17]  Julien Clément,et al.  Counting occurrences for a finite set of words: Combinatorial methods , 2012, TALG.

[18]  Philippe Jacquet,et al.  Analytic Pattern Matching - From DNA to Twitter , 2015 .

[19]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[20]  Wojciech Szpankowski,et al.  Profile of Tries , 2008, LATIN.

[21]  Svante Janson,et al.  On the average sequence complexity , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[22]  S Karlin,et al.  Statistical analyses of counts and distributions of restriction sites in DNA sequences. , 1992, Nucleic acids research.

[23]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.