A statistical analysis of interdependence in character sequences

Abstract This paper presents and demonstrates a methodology for analyzing interdependence in character sequences. Each character is assumed to be dependent on characters separated from it by an arbitrary number, ( m − 1), of spaces. Corresponding to each postulated dependence for different m , a conditional entropy reflecting the statistical significance is calculated for the sequence. Then the conditional self-information of each character with respect to the postulated dependence is determined. A statistical test is made to select from various dependencies those that are significant. The sum of self-information weighted according to various conditional entropies is used as a measure to reflect the syntactic significance of each character. When this methodology is applied to an English text, it is found that dependence on immediate neighbor is dominant, decreases monotonically with increasing separation, and becomes statistically insignificant in cases of separation greater than about 5 spaces. It is also observed that a character with high syntactic significance, as defined in the paper, is more informative for the recognition of the word of which it is a part.