A Method to Calculate Probability and Expected Document Frequency of Discontinued Word Sequences

In this paper, we present a novel technique for calculating the probability of occurrence of a discontinued sequence of n words, that is, the probability that those words occur, and that they occur in a given order, regardless of which and how many other words may occur between them. Our method relies on the formalization of word occurrences into a Markov chain model. Numerous techniques of probability and linear algebra theory are exploited to offer an algorithm of competitive computational complexity. The technique is further extended to permit the calculation of the expected document frequency of an n-words sequence in an efficient manner. We finally present an application of this technique; A fast and automatic direct evaluation of the interestingness of word sequences, by comparing their expected and observed frequencies.