Genre Classification on German Novels

The study of German literature is mostly based on literary canons, i.e., small sets of specifically chosen documents. In particular, the history of novels has been characterized using a set of only 100 to 250 works. In this paper we address the issue of genre classification in the context of a large set of novels using machine learning methods in order to achieve a better understanding of the genre of novels. To this end, we explore how different types of features affect the performance of different classification algorithms. We employ commonly used stylometric features, and evaluate two types of features not yet applied to genre classification, namely topic based features and features based on social network graphs and character interaction. We build features on a data set of close to 1700 novels either written in or translated into German. Even though topics are often considered orthogonal to genres, we find that topic based features in combination with support vector machines achieve the best results. Overall, we successfully apply new feature types for genre classification in the context of novels and give directions for further research in this area.

[1]  Lilith Jappe,et al.  Figurenwissen: Funktionen von Wissen bei der narrativen Figurendarstellung , 2012 .

[2]  Mubarak Shah,et al.  Movie genre classification by exploiting audio-visual features of previews , 2002, Object recognition supported by user interaction for service robots.

[3]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[6]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[7]  Douglas Douglas,et al.  The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings , 1992, Comput. Humanit..

[8]  Kathleen McKeown,et al.  Extracting Social Networks from Literary Fiction , 2010, ACL.

[9]  Matthew L. Jockers Macroanalysis: Digital Methods and Literary History , 2013 .

[10]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[11]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[12]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[13]  Amy J. Devitt Generalizing about Genre: New Conceptions of an Old Concept , 1993, College Composition & Communication.

[14]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[15]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[16]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[17]  Benno Stein,et al.  Genre Classification of Web Pages , 2004, KI.

[18]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[19]  Loretta Auvil,et al.  Mapping mutable genres in structurally complex volumes , 2013, 2013 IEEE International Conference on Big Data.

[20]  F. Puppe,et al.  Automatische Erkennung von Figuren in deutschsprachigen Romanen , 2015, DHd.

[21]  Amir Noori,et al.  On the relation between centrality measures and consensus algorithms , 2011, 2011 International Conference on High Performance Computing & Simulation.