Measuring Readability of Polish Texts: Baseline Experiments

Measuring readability of a text is the first sensible step to its simplification. In this paper we present an overview of the most common approaches to automatic measuring of readability. Of the described ones, we implemented and evaluated: Gunning FOG index, Flesch-based Pisarek method. We also present two other approaches. The first one is based on measuring distributional lexical similarity of a target text and comparing it to reference texts. In the second one, we propose a novel method for automation of Taylor test ― which, in its base form, requires performing a large amount of surveys. The automation of Taylor test is performed using a technique called statistical language modelling. We have developed a free on-line web-based system and constructed plugins for the most common text editors, namely Microsoft Word and OpenOffice.org. Inner workings of the system are described in detail. Finally, extensive evaluations are performed for Polish ― a Slavic, highly inflected language. We show that Pisarek’s method is highly correlated to Gunning FOG Index, even if different in form, and that both the similarity-based approach and automated Taylor test achieve high accuracy. Merits of using either of them are discussed.

[1]  Max F. Meyer,et al.  The Proof and Measurement of Association between Two Things. , 1904 .

[2]  C. O. Houle : Marks of Readable Style: A Study in Adult Education , 1945 .

[3]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .

[4]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[5]  R. Flesch How to Write, Speak, and Think More Effectively , 1960 .

[6]  John R. Bormuth,et al.  READABILITY--A NEW APPROACH. , 1966 .

[7]  R. Gunning The Technique of Clear Writing. , 1968 .

[8]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  M. C. Taylor,et al.  Readability as Applied to an ABE Assessment Instrument. , 1986 .

[11]  Thomas H. Miles Critical thinking and writing for science and technology , 1990 .

[12]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  Walery Pisarek Nowa retoryka dziennikarska , 2002 .

[16]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[17]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[18]  Kumiko Tanaka-Ishii,et al.  Sorting Texts by Readability , 2010, CL.

[19]  Adam Przepiórkowski,et al.  National Corpus of Polish , 2011 .

[20]  Jan Pomikálek Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .

[21]  Robert Bembenik,et al.  Intelligent Tools for Building a Scientific Information Platform , 2013, Intelligent Tools for Building a Scientific Information Platform.

[22]  Mirosław Bańko,et al.  Narodowy Korpus Języka Polskiego , 2012 .

[23]  Bartosz Broda,et al.  KPWr: Towards a Free Corpus of Polish , 2012, LREC.

[24]  Adam Radziszewski A Tiered CRF Tagger for Polish , 2013, Intelligent Tools for Building a Scientific Information Platform.