Classifying Written Texts Through Rhythmic Features

Rhythm analysis of written texts focuses on literary analysis and it mainly considers poetry. In this paper we investigate the relevance of rhythmic features for categorizing texts in prosaic form pertaining to different genres. Our contribution is threefold. First, we define a set of rhythmic features for written texts. Second, we extract these features from three corpora, of speeches, essays, and newspaper articles. Third, we perform feature selection by means of statistical analyses, and determine a subset of features which efficiently discriminates between the three genres. We find that using as little as eight rhythmic features, documents can be adequately assigned to a given genre with an accuracy of around 80 %, significantly higher than the 33 % baseline which results from random assignment.

[1]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[2]  Aniruddh D. Patel,et al.  An empirical comparison of rhythm in language and music , 2003, Cognition.

[3]  A. Prince,et al.  On stress and linguistic rhythm , 1977 .

[4]  Elena Boychuk,et al.  Automated approach for rhythm analysis of french literary texts , 2014, Proceedings of 15th Conference of Open Innovations Association FRUCT.

[5]  Katherine Jones,et al.  Rhythmic Refinements to the nPVI Measure: A Reanalysis of Patel & Daniele (2003a) , 2011 .

[6]  Marcos Dipinto,et al.  Discriminant analysis , 2020, Predictive Analytics.

[7]  Gerald Moore,et al.  Rhythmanalysis: Space, Time and Everyday Life , 2005 .

[8]  Gérard Bailly,et al.  Characterisation of rhythmic patterns for text-to-speech synthesis , 1994, Speech Communication.

[9]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[10]  R. Gonzalez Applied Multivariate Statistics for the Social Sciences , 2003 .

[11]  E. Grabe,et al.  Durational variability in speech and the rhythm class hypothesis , 2005 .

[12]  Doug Beeferman The Rhythm of Lexical Stress in Prose , 1996, ACL.

[13]  Ray Jackendoff,et al.  A Grammatical Parallel between Music and Language , 1982 .

[14]  James Lani,et al.  Multivariate GLM, MANOVA, and MANCOVA , 2010 .

[15]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[16]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[17]  Nancy L. Garcia,et al.  Context tree selection and linguistic rhythm retrieval from written texts , 2012 .

[18]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .