Extension of Zipf’s Law to Words and Phrases

Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for n-gram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words. However, when single word and n-gram phrases are combined together in one list and put in order of frequency the combined list follows Zipf's law accurately for all words and phrases, down to the lowest frequencies in both languages. The Zipf curves for the two languages are then almost identical.

[1]  Benoit B. Mandelbrot,et al.  A Note On a Class of Skew Distribution Functions: Analysis and Critique of a Paper by H. A. Simon , 1959, Inf. Control..

[2]  Jane Fedorowicz A Zipfian Model of an Automatic Bibliographic System: An Application to MEDLINE , 1982, J. Am. Soc. Inf. Sci..

[3]  Wentian Li,et al.  Zipf's law in importance of genes for cancer classification using microarray data. , 2001, Journal of theoretical biology.

[4]  Yasuo Yonezawa,et al.  Zipf-Scaling Description in the DNA Sequences , 1999 .

[5]  Benoit B. Mandelbrot,et al.  Final Note on a Class of Skew Distribution Functions: Analysis and Critique of a Mode Due to H. A. Simon , 1961, Inf. Control..

[6]  Herbert A. Simon Reply to "Final Note" by Benoit Mandelbrot , 1961, Inf. Control..

[7]  F. J. Smith,et al.  Storing and retrieving word phrases , 1985, Inf. Process. Manag..

[8]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[9]  Laurence D. Stephens,et al.  Studies on Zipf's law , 1984 .

[10]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[11]  Benoit B. Mandelbrot,et al.  Post Scriptum to "Final Note" , 1961, Inf. Control..

[12]  Herbert A. Simon,et al.  Some Further Notes on a Class of Skew Distribution Functions , 1960, Inf. Control..

[13]  Christer Samuelsson Relating Turing's Formula and Zipf's Law , 1996, VLC@COLING.

[14]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[15]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[16]  Herbert A. Simon,et al.  Reply to Dr. Mandelbrot's Post Scriptum , 1961, Inf. Control..

[17]  Andrew Donald Booth,et al.  A "Law" of Occurrences for Words of Low Frequency , 1967, Inf. Control..

[18]  Benoit B. Mandelbrot,et al.  Simpie games of strategy occurring in communication through natural languages , 1954, Trans. IRE Prof. Group Inf. Theory.

[19]  Marcelo A. Montemurro,et al.  Beyond the Zipf-Mandelbrot law in quantitative linguistics , 2001, ArXiv.

[20]  Z. K. Silagadze,et al.  Citations and the Zipf-Mandelbrot Law , 1999, Complex Syst..