Fitting Ranked English and Spanish Letter Frequency Distribution in US and Mexican Presidential Speeches

Abstract The limited range in the abscissa of ranked letter frequency distributions causes multiple functions to fit the observed distribution reasonably well. In order to critically compare various functions, we apply the statistical model selections on ten functions, using the texts of US and Mexican presidential speeches of the last few centuries. Despite minor switching of ranking order of certain letters during the temporal evolution for both datasets, the letter usage is generally stable. The best fitting function, judged by either least-square-error or by AIC/BIC model selection, is the Cocho/Beta function. We also use a novel method to discover clusters of letters by their observed-over-expected frequency ratios.

[1]  Yukio-Pegio Gunji,et al.  Zipf's law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. , 2004, Bio Systems.

[2]  Germinal Cocho,et al.  On the behavior of journal impact factor rank-order distribution , 2006, J. Informetrics.

[3]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[4]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[5]  W. Hirte,et al.  Microbiology: Concepts and Applications, Paul A. Ketchum. John Wiley & Sons, Inc., New York, Chichester, Brisbane, Toronto, Singapore (1988), 795 Seiten. Preis: £40.30 , 1989 .

[6]  Milan Rusko,et al.  Letter, Grapheme and (Allo-)Phone Frequencies: The Case of Slovak , 2009 .

[7]  G. Cocho,et al.  Tail universalities in rank distributions as an algebraic problem: The beta-like function , 2007, 0705.0551.

[8]  L. Frappat,et al.  Conspiracy in bacterial genomes , 2005, q-bio/0507030.

[9]  G. Gamow,et al.  STATISTICAL CORRELATION OF PROTEIN AND RIBONUCLEIC ACID COMPOSITION. , 1955, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Shumin Zhai,et al.  The metropolis keyboard - an exploration of quantitative techniques for virtual keyboard design , 2000, UIST '00.

[11]  P. David Clio and the Economics of QWERTY , 1985 .

[12]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[13]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[14]  Germinal Cocho,et al.  Order–disorder transition in conflicting dynamics leading to rank–frequency generalized beta distributions , 2011 .

[15]  Brian D. Ripley,et al.  Modern applied statistics with S, 4th Edition , 2002, Statistics and computing.

[16]  Douglas M. Bates,et al.  Nonlinear Regression Analysis and Its Applications , 1988 .

[17]  Gabriel Altmann,et al.  Towards a Unified Derivation of Some Linguistic Laws , 2007 .

[18]  Gabriel Altmann,et al.  Discrete and continuous modelling in quantitative linguistics* , 2007, J. Quant. Linguistics.

[19]  Borodovsky MYu,et al.  A general rule for ranged series of codon frequencies in different genomes. , 1989, Journal of biomolecular structure & dynamics.

[20]  L Frappat,et al.  Universality and Shannon entropy of codon usage. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  Germinal Cocho,et al.  Fitting Ranked Linguistic Data with Two-Parameter Functions , 2010, Entropy.

[22]  Peter Grzybek,et al.  On the systematic and system-based study of grapheme frequencies: a re-analysis of German letter frequencies , 2007, Glottometrics.

[23]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[24]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[25]  Mark Borodovsky,et al.  Comparison of Equations Describing the Ranked Frequency Distributions of Graphemes and Phonemes , 1996, J. Quant. Linguistics.

[26]  H. Akaike A new look at the statistical model identification , 1974 .

[27]  Arjuna Tuzzi,et al.  Zipf's Laws in Italian Texts , 2009, J. Quant. Linguistics.

[28]  General model of subtraction of stochastic variables. Attractor and stability analysis , 2011 .

[29]  Kanter,et al.  Markov processes: Linguistics and Zipf's law. , 1995, Physical review letters.

[30]  Emmerich Kelih Graphemhäufigkeiten in slawischen Sprachen: stetige Modelle , 2009, Glottometrics.

[31]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[32]  Mihai Mitrea,et al.  Two frequency-rank law for letters printed in Romanian , 2000 .

[33]  G. Yule,et al.  A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[34]  G. Cocho,et al.  Universality of Rank-Ordering Distributions in the Arts and Sciences , 2009, PloS one.

[35]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.