论文信息 - Making topic modeling easy. A programming library in Python

Making topic modeling easy. A programming library in Python

Topic modeling, a method for the semantic analysis of large text collections, has been in the focus of interest in digital literary studies during the recent years. The method uses probabilistic procedures to generate probability distributions for words out of a collection of texts, sorting many single word distributions into distinct semantic groups called ‘topics’. These topics constitute groups of semantically related words, and the contribution of each topic to the composition of each text can be quantified mathematically (Blei 2012, Steyvers und Griffiths 2006). In digital literary studies, topic models can be interesting in themselves. For example their dynamic development either during the plot of single literary texts or over multiple texts in a stage of literary history can be analyzed (Jockers 2013, Blevins 2012, Rhody 2012, Schöch to appear), though comparing literary themes and the probabilistic concept of ‘topics’ described here is obviously not unproblematic. And topic models can also be interesting features for classifying or clustering texts (Blei 2012). There are currently two state-of-the-art implementations of the relevant algorithms: ‘Mallet’ (McCallum 2002) and ‘Gensim’ (Rehurek 2010). But usually more is required than simply running a topic modeling algorithm (Fig. 1): • Longer texts like novels need to be split into smaller parts (e.g. paragraphs, scenes, or a fixed amount of characters or words). • NLP based preprocessing is necessary • To achieve optimal results, texts must be reduced to content words, either by filtering out function words with stopword lists, or by using a part-of-speech tagger to exclude unwanted word classes. • Similarly, lemmatization and elimination of proper names can be useful. • After the topics have been generated, results are usually visualized based on the relevant metadata. • Results need to be evaluated with regard to internal or external criteria rather than just being left to interpretation.

Fotis Jannidis | Christof Schöch | Thorsten Vitt | Steffen Pielström

[1] Matthew L. Jockers. Macroanalysis: Digital Methods and Literary History , 2013 .

[2] Chong Wang,et al. Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[3] David M. Blei,et al. Probabilistic topic models , 2012, Commun. ACM.

[4] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[5] Christof Schöch,et al. Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama , 2015, Digit. Humanit. Q..

[6] Lisa Rhody. Topic Modeling and Figurative Language , 2012 .

[7] Ruslan Salakhutdinov,et al. Evaluation methods for topic models , 2009, ICML '09.