On the law of Zipf-Mandelbrot for multi-word phrases

This article studies the probabilities of the occurrence of multi-word (m-word) phrases (m = 2,3,... ) in relation to the probabilities of occurrence of the single words. It is well known that, in the latter case, the law of Zipf is valid (i.e., a power law). We prove that in the case of m-word phrases (m ≥ 2), this is not the case. We present two independent proofs of this. We furthermore show that, in case we want to approximate the found distribution by Zipf's law, we obtain exponents β m in this power law for which the sequence (β m ) m ∈ N is strictly decreasing. This explains experimental findings of Smith and Devine (1985), Hilberg (1988), and Meyer (1989a,b). Our results should be compared with a heuristic finding of Rousseau who states that the law of Zipf-Mandelbrot is valid for multi-word phrases. He, however, uses other-less classical-assumptions than we do.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  B. Mandelbrot Fractal Geometry of Nature , 1984 .

[3]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[4]  Last Words: A Test of Authorship for Greek Writers: A Test of Authorship for Greek Writers , 1972 .

[5]  Ye-Sho Chen,et al.  Statistical Models of Text in Continuous Speech Recognition , 1991 .

[6]  H. S. Heaps,et al.  Selection of equifrequent word fragments for information retrieval , 1973, Inf. Storage Retr..

[7]  Claude S. Brinegar,et al.  Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship , 1963 .

[8]  William McColly,et al.  Literary attribution and likelihood-ratio tests: The case of the middle EnglishPearl-poems , 1983, Comput. Humanit..

[9]  Leo Egghe,et al.  Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science , 1990 .

[10]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[11]  Benoit B. Mandelbrot,et al.  Fractals: Form, Chance and Dimension , 1978 .

[12]  Benoit B. Mandelbrot,et al.  Structure Formelle des Textes et Communication , 1954 .

[13]  Ye-Sho Chen,et al.  ZIPF'S LAWS IN TEXT MODELING , 1989 .

[14]  David R. Cox,et al.  On a Discriminatory Problem Connected with the Works of Plato , 1959 .

[15]  F. J. Smith,et al.  Storing and retrieving word phrases , 1985, Inf. Process. Manag..

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  Leo Egghe,et al.  Duality in information retrieval and the hypergeometric distribution , 1997, J. Documentation.