Readability Classification for German using Lexical, Syntactic, and Morphological Features

We investigate the problem of reading level assessment for German texts on a newly compiled corpus of freely available easy and difficult articles, targeted at adult and child readers respectively. We adapt a wide range of syntactic, lexical and language model features from previous research on English and combined them with new features that make use of the rich morphology of German. We show that readability classification for German based on these features is highly successful, reaching 89.7% accuracy, with the new morphological features making an important contribution.

[1]  Jasmine Benn A Web-based Personalised Textfinder for Language Learners , 2009 .

[2]  Kellogg W. Hunt,et al.  Do Sentences in the Second Language Grow Like Those in the First , 1970 .

[3]  Sven Hartrumpf,et al.  A Semantically Oriented Readability Checker for German , 2007 .

[4]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  K. W. Hunt Grammatical structures written at three grade levels , 1965 .

[8]  Sigrid Klerke,et al.  DSim, a Danish Parallel Corpus for Text Simplification , 2012, LREC.

[9]  Walt Detmar Meurers,et al.  Lexical Generalizations in the Syntax of German Non-Finite Constructions , 1999 .

[10]  Bernd Naumann,et al.  Worbildung in der deutschen Gegenwartssprache , 1974 .

[11]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[12]  Kevyn Collins-Thompson,et al.  An Analysis of Statistical Models and Features for Reading Difficulty Prediction , 2008, ACL 2008.

[13]  Walt Detmar Meurers,et al.  On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition , 2012, BEA@NAACL-HLT.

[14]  Mari Ostendorf,et al.  A machine learning approach to reading level assessment , 2009, Comput. Speech Lang..

[15]  Walt Detmar Meurers,et al.  Information retrieval for education: making search engines language aware , 2011 .

[16]  Lucia Specia,et al.  Readability Assessment for Text Simplification , 2010 .

[17]  Simonetta Montemagni,et al.  READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification , 2011, SLPAT.

[18]  Lijun Feng,et al.  Automatic Readability Assessment , 2010 .

[19]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[20]  William H. DuBay The Classic Readability Studies. , 2007 .

[21]  Christopher D. Manning,et al.  Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines , 2008 .

[22]  A. Karimi,et al.  Master‟s thesis , 2011 .

[23]  B. Miller,et al.  A Readability Checker with Supervised Learning using Deep Syntactic and Semantic Indicators , 2008 .

[24]  William H. DuBay The Principles of Readability. , 2004 .

[25]  Danielle S. McNamara,et al.  Understanding expert ratings of essay quality: Coh-Metrix analyses of first and second language writing , 2011 .

[26]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[27]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.

[28]  Helmut Schmid,et al.  Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging , 2008, COLING.

[29]  Xiaofei Lu,et al.  Automatic analysis of syntactic complexity in second language writing , 2010 .

[30]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[31]  Eleni Miltsakaki,et al.  Matching Readers’ Preferences and Reading Skills with Appropriate Web Texts , 2009, EACL.

[32]  Xiaofei Lu The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives. , 2012 .

[33]  Luo Si,et al.  A statistical model for scientific readability , 2001, CIKM '01.

[34]  Satoshi Sato,et al.  Automatic Assessment of Japanese Text Readability Based on a Textbook Corpus , 2008, LREC.

[35]  Cédrick Fairon,et al.  An “AI readability” Formula for French as a Foreign Language , 2012, EMNLP.

[36]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[37]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[38]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[39]  Mark Dredze,et al.  Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language , 2010, HLT-NAACL 2010.