Assessment of a Modern Farsi Corpus

The development of Language Engineering (LE) and Information Retrieval (IR) applications requires availability of sizeable, reliable and representative corpora. This paper describes how we have constructed a well-structured 345 MB tagged corpus of news, and presents some beneficial statistics of this corpus based upon the characteristics of Farsi language. It also goes into particular detail on the fitness of the frequency and rank of Farsi words with Zipf-Mandelbrot’s law. We will then present our measurement of Entropy of Farsi for this corpus.

[1]  András Kornai Zipf’s law outside the middle range , 2007 .

[2]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[3]  Z. K. Silagadze,et al.  Citations and the Zipf-Mandelbrot Law , 1999, Complex Syst..

[4]  R. Mantegna,et al.  Zipf plots and the size distribution of firms , 1995 .

[5]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[6]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[7]  Makoto Nagao,et al.  A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[8]  Rémi Zajac,et al.  Persian-English Machine Translation: An Overview of the Shiraz Project , 2000 .

[9]  J. Brickmann B. Mandelbrot: The Fractal Geometry of Nature, Freeman and Co., San Francisco 1982. 460 Seiten, Preis: £ 22,75. , 1985 .

[10]  A. Roeck,et al.  Assessment of a Significant Arabic Corpus , 2001 .

[11]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[12]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[13]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[14]  H. Stanley,et al.  Modelling urban growth patterns , 1995, Nature.