Automatic recognition of speech, thought, and writing representation in German narrative texts

This article presents the main results of a project, which explored ways to recognize and classify a narrative feature—speech, thought, and writing representation (ST&WR)—automatically, using surface information and methods of computational linguistics. The task was to detect and distinguish four types—direct, free indirect, indirect, and reported ST&WR—in a corpus of manually annotated German narrative texts. Rule-based as well as machine-learning methods were tested and compared. The results were best for recognizing direct ST&WR (best F1 score: 0.87), followed by indirect (0.71), reported (0.58), and finally free indirect ST&WR (0.40). The rule-based approach worked best for ST&WR types with clear patterns, like indirect and marked direct ST&WR, and often gave the most accurate results. Machine learning was most successful for types without clear indicators, like free indirect ST&WR, and proved more stable. When looking at the percentage of ST&WR in a text, the results of machine-learning methods always correlated best with the results of manual annotation. Creating a union or intersection of the results of the two approaches did not lead to striking improvements. A stricter definition of ST&WR, which excluded borderline cases, made the task harder and led to worse results for both approaches.

[1]  Helmut Schmid,et al.  Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging , 2008, COLING.

[2]  Luís Sarmento,et al.  Automatic extraction of quotes and topics from news feeds , 2009 .

[3]  Geoffrey Leech,et al.  Style in Fiction: A Linguistic Introduction to English Fictional Prose , 1982 .

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Kathleen McKeown,et al.  Automatic Attribution of Quoted Speech in Literary Narrative , 2010, AAAI.

[6]  G. Genette,et al.  Narrative Discourse, an Essay in Method. , 1980 .

[7]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[8]  Nuno J. Mamede,et al.  Character Identification in Children Stories , 2004, EsTAL.

[9]  G. Genette,et al.  Narrative discourse : an essay in method , 1980 .

[10]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[11]  Geoffrey Leech,et al.  Style in fiction , 1981 .

[12]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[13]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  Ralf Krestel,et al.  Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles , 2008, LREC.

[16]  Heike Neuroth,et al.  TextGrid - Virtual Research Environment for the Humanities , 2011, Int. J. Digit. Curation.