Combining Topic Models for Corpus Exploration: Applying LDA for Complex Corpus Research Tasks in a Digital Humanities Project

We investigate new ways of applying LDA topic models: rather than optimizing a single model for a specific use case, we train multiple models based on different parameters and vocabularies which are combined on-the-fly to comply with varying information retrieval tasks. We also show a semi-automatic method which helps users to identify relevant topics across multiple models. Our methods are demonstrated and evaluated on a real-world use case: a large-scale corpus-based digital humanities project called Welt der Kinder ("Children and their World"). We illustrate our approach in that context and show that it can be generalized to other scenarios. We evaluate this work using empirical methods from information retrieval, but also show visualizations and use cases as actually applied in the project.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[4]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[6]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[7]  Andrew McCallum,et al.  Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.

[8]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[9]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[10]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[11]  Steven Bethard,et al.  Building Test Suites for UIMA Components , 2009 .

[12]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[13]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[14]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[15]  Gerhard Heyer,et al.  SentiWS - A Publicly Available German-language Resource for Sentiment Analysis , 2010, LREC.

[16]  D. Blei,et al.  Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding , 2013 .

[17]  Iryna Gurevych,et al.  A broad-coverage collection of portable NLP components for building shareable analysis pipelines , 2014, OIAF4HLT@COLING.

[18]  Travis Brown,et al.  Mining the Dispatch under Supervision : Using Casualty Counts to Guide Topics from the Richmond Daily Dispatch Corpus , 2014 .