论文信息 - SLaTE: A System for Labeling Topics with Entities

SLaTE: A System for Labeling Topics with Entities

In recent years, the Latent Dirichlet allocation (LDA) topic model (Blei, Ng, and Jordan, 2003) has become one of the most employed text mining techniques (Meeks and Weingart 2012) in the digital humanities (DH). Scholars have often noted its potential for text exploration and distant reading analyses, even when it is well known that its results are difficult to interpret (Chang et al, 2009) and to evaluate (Wallach et al, 2009). At last year’s edition of the Digital Humanities conference, we introduced a new corpus exploration method able to produce topics that are easier to interpret and evaluate than standard LDA topic models (Nanni and Ruiz, 2016). We did so by combining two existing techniques, namely Entity linking and Labeled LDA (L-LDA). At its heart, our method first identifies a collection of descriptive labels for the topics of arbitrary documents from a corpus, as provided from the vocabulary of entities found within wide-coverage knowledge resources (e.g., Wikipedia, DBpedia). Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier, and using a disambiguated knowledge resource as background knowledge limits label ambiguity. As our topics are described with a limited number of unambiguous labels, they promote interpretability, and this may sustain the use of the results as quantitative evidence in humanities research (Lauscher et al, 2016). The contributions of this poster cover the release of: a) a complete implementation of the processing pipeline for our entity-based LDA approach; b) a three-step evaluation platform that enables its extensive quantitative analysis.

Simone Paolo Ponzetto | Federico Nanni | Anne Lauscher

[1] Chong Wang,et al. Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[2] Simone Paolo Ponzetto,et al. Entities as topic labels : combining entity linking and labeled LDA to improve topic interpretability and evaluability , 2016 .

[3] Paolo Ferragina,et al. TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[4] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5] Ruslan Salakhutdinov,et al. Evaluation methods for topic models , 2009, ICML '09.

[6] Federico Nanni,et al. Entities as topic labels: Improving topic interpretability and evaluability combining Entity Linking and Labeled LDA , 2016, DH.