Fitting Ranked Linguistic Data with Two-Parameter Functions

It is well known that many ranked linguistic data can fit well with one-parameter models such as Zipf’s law for ranked word frequencies. However, in cases where discrepancies from the one-parameter model occur (these will come at the two extremes of the rank), it is natural to use one more parameter in the fitting model. In this paper, we compare several two-parameter models, including Beta function, Yule function, Weibull function—all can be framed as a multiple regression in the logarithmic scale—in their fitting performance of several ranked linguistic data, such as letter frequencies, word-spacings, and word frequencies. We observed that Beta function fits the ranked letter frequency the best, Yule function fits the ranked word-spacing distribution the best, and Altmann, Beta, Yule functions all slightly outperform the Zipf’s power-law function in word ranked- frequency distribution.

[1]  Wentian Li,et al.  Two-parameter characterization of chromosome-scale recombination rate. , 2009, Genome research.

[2]  M. W. McCoy,et al.  The Random Nature of Genome Architecture: Predicting Open Reading Frame Distributions , 2009, PloS one.

[3]  Elisabeth Dévière,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[4]  G. Cocho,et al.  Universality of Rank-Ordering Distributions in the Arts and Sciences , 2009, PloS one.

[5]  P. Carpena,et al.  Level statistics of words: finding keywords in literary texts and symbolic sequences. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[7]  R. Jernigan A Photographic View of Cumulative Distribution Functions , 2008 .

[8]  G. Cocho,et al.  Tail universalities in rank distributions as an algebraic problem: The beta-like function , 2007, 0705.0551.

[9]  Marco Baroni,et al.  zipfR : word frequency distributions in R , 2007, ACL 2007.

[10]  Germinal Cocho,et al.  On the behavior of journal impact factor rank-order distribution , 2006, J. Informetrics.

[11]  Wentian Li,et al.  The-more-the-better and the-less-the-better , 2006, Bioinform..

[12]  Bill Z. Manaris,et al.  Investigating Esperanto's Statistical Proportions Relative to other Languages using Neural Networks and Zipf's Law , 2006, Artificial Intelligence and Applications.

[13]  Yukio-Pegio Gunji,et al.  Zipf's law in phonograms and Weibull distribution in ideograms: comparison of English with Japanese. , 2004, Bio Systems.

[14]  Mary Richardson,et al.  Morse Code, Scrabble, and the Alphabet , 2004 .

[15]  Stefan Evert,et al.  A Simple LNRE Model for Random Character Sequences , 2004 .

[16]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Word and Character N-grams for English and Chinese , 2003, ROCLING/IJCLCLP.

[17]  P Bernaola-Galván,et al.  A simple and species-independent coding measure. , 2002, Gene.

[18]  Pedro Carpena,et al.  Keyword detection in natural languages and DNA , 2002 .

[19]  Ronald Rousseau,et al.  George Kingsley Zipf: life, ideas, his law and informetrics , 2002, Glottometrics.

[20]  Wentian Li,et al.  Zipf's Law everywhere , 2002, Glottometrics.

[21]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[22]  Carlos M. Urzúa A simple and efficient test for Zipf's law , 2000 .

[23]  Wentian Li,et al.  Statistical Properties of Open Reading Frames in Complete Genome Sequences , 1999, Comput. Chem..

[24]  Calvin L. Williams,et al.  Modern Applied Statistics with S-Plus , 1997 .

[25]  D. Kleinbaum Survival Analysis: A Self-Learning Text , 1997 .

[26]  B. Ripley,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[27]  Thomas P. Ryan,et al.  Modern Regression Methods , 1996 .

[28]  Mark Borodovsky,et al.  Comparison of Equations Describing the Ranked Frequency Distributions of Graphemes and Phonemes , 1996, J. Quant. Linguistics.

[29]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[30]  Ulrich Kamecke,et al.  Testing the rank size rule hypothesis with an efficient estimator , 1990 .

[31]  Borodovsky MYu,et al.  A general rule for ranged series of codon frequencies in different genomes. , 1989, Journal of biomolecular structure & dynamics.

[32]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[33]  H. Akaike A new look at the statistical model identification , 1974 .

[34]  Gustav Herdan,et al.  The advanced theory of language as choice and chance , 1968 .

[35]  W. F. Twaddell,et al.  Die Architektonik des deutschen Wortschatzes , 1954 .

[36]  W. Weibull A Statistical Distribution Function of Wide Applicability , 1951 .

[37]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[38]  Alfred J. Lotka,et al.  The frequency distribution of scientific productivity , 1926 .

[39]  G. Yule,et al.  A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[40]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .