Exploring Measures of "Readability" for Spoken Language: Analyzing linguistic features of subtitles to identify age-specific TV programs

We investigate whether measures of readability can be used to identify age-specific TV programs. Based on a corpus of BBC TV subtitles, we employ a range of linguistic readability features motivated by Second Language Acquisition and Psycholinguistics research. Our hypothesis that such readability features can successfully distinguish between spoken language targeting different age groups is fully confirmed. The classifiers we trained on the basis of these readability features achieve a classification accuracy of 95.9%. Investigating several feature subsets, we show that the authentic material targeting specific age groups exhibits a broad range of linguistics and psycholinguistic characteristics that are indicative of the complexity of the language used.

[1]  Marc Brysbaert,et al.  Subtlex-UK: A New and Improved Word Frequency Database for British English , 2014, Quarterly journal of experimental psychology.

[2]  Lijun Feng,et al.  Automatic Readability Assessment , 2010 .

[3]  Walt Detmar Meurers,et al.  On The Applicability of Readability Models to Web Texts , 2013, PITR@ACL.

[4]  Mark A. Finlayson Java Libraries for Accessing the Princeton Wordnet: Comparison and Evaluation , 2014, GWC.

[5]  Walt Detmar Meurers,et al.  On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition , 2012, BEA@NAACL-HLT.

[6]  T. James,et al.  THE COUNCIL OF CHIEF STATE SCHOOL OFFICERS , 2009 .

[7]  Marc Brysbaert,et al.  The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words , 2011, Behavior Research Methods.

[8]  Klaus Zechner,et al.  Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech , 2011, ACL.

[9]  Stathes Hadjiefthymiades,et al.  Semantic Video Classification Based on Subtitles and Domain Terminologies , 2007, KAMC.

[10]  Danielle S. McNamara,et al.  Applications of Text Analysis Tools for Spoken Response Grading , 2013 .

[11]  Lucia Specia,et al.  UOW-SHEF: SimpLex – Lexical Simplicity Ranking based on Contextual and Psycholinguistic Features , 2012, *SEMEVAL.

[12]  Michael J Cortese,et al.  Age of acquisition ratings for 3,000 monosyllabic words , 2008, Behavior research methods.

[13]  Michael Flor,et al.  Lexical Tightness and Text Complexity , 2013 .

[14]  Kevyn Collins-Thompson,et al.  Predicting reading difficulty with statistical language models , 2005, J. Assoc. Inf. Sci. Technol..

[15]  Andy Way,et al.  SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles , 2012, LREC.

[16]  Xiaofei Lu The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives. , 2012 .

[17]  Walt Detmar Meurers,et al.  Readability Classification for German using Lexical, Syntactic, and Morphological Features , 2012, COLING.

[18]  Xiaofei Lu,et al.  Automatic analysis of syntactic complexity in second language writing , 2010 .

[19]  Walter Daelemans,et al.  Automatic Sentence Simplification for Subtitling in Dutch and English , 2004, LREC.

[20]  Walt Detmar Meurers,et al.  Assessing the relative reading level of sentence pairs for text simplification , 2014, EACL.

[21]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  M. Brysbaert,et al.  Age-of-acquisition ratings for 30,000 English words , 2012, Behavior research methods.

[24]  Roger Levy,et al.  Tregex and Tsurgeon: tools for querying and manipulating tree data structures , 2006, LREC.

[25]  Walt Detmar Meurers,et al.  Readability assessment for text simplification: From analysing documents to identifying sentential simplifications , 2014 .

[26]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[27]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.