Generation of a Model for Grapheme Frequencies and its Refinement and Validation by Group Theoretic Aspects

Abstract The occurrences of graphemes in a text are generally determined by Zipf's law. In an attempt to develop a theoretical model for grapheme frequencies, Grzybek and Kelih have tested different distribution models and have come to the conclusion that rank frequency distribution for Slavic languages can be expressed in the form of the negative hypergeometric distribution. The application of this distribution to different corpora has led us to derive a functional relationship between ranks and letters of the English language alphabet and thus has formed a platform for the present study. In order to identify the patterns of letters in the corpus, we have applied group theoretical aspects and have observed that different rings are generated corresponding to ranks 1, 2 having values in the range 23–26, fields for ranks in ranges 3–9 and 10–22. Applications of these rings and fields reveal that frequency distribution can always be fitted by locally adopting an equation in the sets. It has led us to generate a general model for rank frequency distribution of English texts.

[1]  Robert L. Solso,et al.  Frequency and versatility of letters in the English language , 1976 .

[2]  Tim Bell,et al.  SOURCE MODELS FOR NATURAL LANGUAGE , 1988 .

[3]  S. Naranan,et al.  Models for Power Law Relations in Linguistics and Information Science , 1998, J. Quant. Linguistics.

[4]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[5]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6]  Mark Borodovsky,et al.  Comparison of Equations Describing the Ranked Frequency Distributions of Graphemes and Phonemes , 1996, J. Quant. Linguistics.

[7]  Peter Grzybek,et al.  Towards a General Model of Grapheme Frequencies for Slavic Languages , 2006 .

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  Ali Eftekhari,et al.  Fractal geometry of texts: An initial application to the works of Shakespeare , 2006, J. Quant. Linguistics.

[10]  A. M. Ramer Mathematical Methods in Linguistics , 1992 .

[11]  Arnaud Rey,et al.  Graphemes are perceptual reading units , 2000, Cognition.

[12]  Bengt Sigurd,et al.  Rank-Frequency Distributions for Phonemes , 1968 .

[13]  Steven Pinker,et al.  Words and rules , 1998 .

[14]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[15]  Marcus Kracht,et al.  The mathematics of language , 2003 .

[16]  William Ray Arney,et al.  Statistics as Language. , 1979 .