A Simple LNRE Model for Random Character Sequences

This paper describes a population model for word frequency distributions based on the Zipf-Mandelbrot law, corresponding to the word frequency distribution induced by a random character sequence. The model, which has convenient analytical and numerical properties, is shown to be adequate for the description of language data extracted by automatic means from large text corpora. It can thus be used to study the problems faced by the statistical analysis of such data in the field of natural-language processing.

[1]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[2]  Eric W. Weisstein,et al.  Eric Weisstein''s World of Mathematics , 1999, WWW 1999.

[3]  F. W. Preston The Commonness, And Rarity, of Species , 1948 .

[4]  Herbert A. Simon,et al.  Some Further Notes on a Class of Skew Distribution Functions , 1960, Inf. Control..

[5]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[6]  D. Mcneil Estimating an Author's Vocabulary , 1973 .

[7]  Wolfgang Lezius,et al.  IMSLex – Representing Morphological and Syntactic Information in a Relational Database , 2000 .

[8]  R. Baayen,et al.  Chronicling the Times: Productive Lexical Innovations in an English Newspaper , 1996 .

[9]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Words and Phrases , 2002, COLING.

[10]  E. Khmaladze The statistical analysis of a large number of rare events , 1988 .

[11]  P. Holgate,et al.  Species frequency distributions , 1969 .

[12]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[13]  A. Rouault,et al.  Lois de Zipf et sources markoviennes , 1978 .

[14]  Steinar Engen,et al.  On species frequency models , 1974 .

[15]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[16]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[17]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[18]  Stefan Evert,et al.  Experiments on Candidate Data for Collocation Extraction , 2003, EACL.

[19]  H. Sichel On a Distribution Law for Word Frequencies , 1975 .

[20]  G. Miller,et al.  Some effects of intermittent silence. , 1957, The American journal of psychology.

[21]  GUSTAV HERDAN QUANTITATIVE LINGUISTICS OR GENERATIVE GRAMMAR? , 1964 .

[22]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[23]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.