Speeding Document Annotation with Topic Models

Document classification and topic models are useful tools for managing and understanding large corpora. Topic models are used to uncover underlying semantic and structure of document collections. Categorizing large collection of documents requires hand-labeled training data, which is time consuming and needs human expertise. We believe engaging user in the process of document labeling helps reduce annotation time and address user needs. We present an interactive tool for document labeling. We use topic models to help users in this procedure. Our preliminary results show that users can more eectively and eciently apply labels to documents using topic model information.

[1]  Tom Louwerse Mapping Policy Preferences II: Estimates for Parties, Electors, and Governments in Eastern Europe, European Union and OECD 1990–2003 , 2009 .

[2]  Jason Chuang,et al.  Document Exploration with Topic Modeling : Designing Interactive Visualizations to Support Effective Analysis Workflows , 2013 .

[3]  Frank D. Wood,et al.  Hierarchically Supervised Latent Dirichlet Allocation , 2011, NIPS.

[4]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[5]  Soo-Min Kim,et al.  Determining the Sentiment of Opinions , 2004, COLING.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[8]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[9]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[10]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[11]  Edward F. Kelly,et al.  Computer recognition of English word senses , 1975 .

[12]  Timothy Baldwin,et al.  Best Topic Word Selection for Topic Labelling , 2010, COLING.

[13]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[14]  I. Budge,et al.  Mapping Policy Preferences: Estimates for Parties, Electors, and Governments 1945-1998 , 2001 .

[15]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..