Evaluating topic representations for exploring document collections

Topic models have been shown to be a useful way of representing the content of large document collections, for example, via visualization interfaces (topic browsers). These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a term list; that is the top‐n words with highest conditional probability within the topic. Other topic representations such as textual and image labels also have been proposed. However, there has been no comparison of these alternative representations. In this article, we compare 3 different topic representations in a document retrieval task. Participants were asked to retrieve relevant documents based on predefined queries within a fixed time limit, presenting topics in one of the following modalities: (a) lists of terms, (b) textual phrase labels, and (c) image labels. Results show that textual labels are easier for users to interpret than are term lists and image labels. Moreover, the precision of retrieved documents for textual and image labels is comparable to the precision achieved by representing topics using term lists, demonstrating that labeling methods are an effective alternative topic representation.

[1]  Marti A. Hearst Search User Interfaces , 2009 .

[2]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[3]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[4]  Padhraic Smyth,et al.  TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling , 2012, TIST.

[5]  Timothy Baldwin,et al.  Visualizing search results and document collections using topic maps , 2010, J. Web Semant..

[6]  Derek Greene,et al.  Unsupervised graph-based topic labelling using dbpedia , 2013, WSDM.

[7]  Sepandar D. Kamvar,et al.  An Analytical Comparison of Approaches to Personalizing PageRank , 2003 .

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[10]  Hongfei Yan,et al.  Automatic labeling hierarchical topics , 2012, CIKM '12.

[11]  Timothy Baldwin,et al.  Best Topic Word Selection for Topic Labelling , 2010, COLING.

[12]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[13]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[14]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[15]  Gabriella Kazai,et al.  In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.

[16]  P. M. Govindakrishnan,et al.  AN ANALYTICAL COMPARISON , 2004 .

[17]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[18]  Marti A. Hearst Clustering versus faceted categories for information exploration , 2006, Commun. ACM.

[19]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[20]  Jeffrey Heer,et al.  Interpretation and trust: designing model-driven visualizations for text analysis , 2012, CHI.

[21]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Timothy Baldwin,et al.  Using ontological and document similarity to estimate museum exhibit relatedness , 2011, JOCCH.

[24]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[25]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[26]  Matt Gardner The Topic Browser An Interactive Tool for Browsing Topic Models , 2010 .

[27]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[28]  Gareth J. F. Jones,et al.  TopicVis: a GUI for topic-based feedback and navigation , 2013, SIGIR.

[29]  Fabio Stella,et al.  Automatic Labeling of Topics , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[30]  Desney S. Tan,et al.  FacetMap: A Scalable Search and Browse Visualization , 2006, IEEE Transactions on Visualization and Computer Graphics.

[31]  Mark Stevenson,et al.  Labelling Topics using Unsupervised Graph-based Methods , 2014, ACL.

[32]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[33]  Mark Stevenson,et al.  Representing Topics Using Images , 2013, HLT-NAACL.

[34]  Martin Wattenberg,et al.  Parallel Tag Clouds to explore and analyze faceted text corpora , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[35]  Alexander Hinneburg,et al.  TopicExplorer: Exploring Document Collections with Topic Models , 2012, ECML/PKDD.

[36]  Qiang Zhang,et al.  TIARA: a visual exploratory text analytic system , 2010, KDD '10.

[37]  Timothy Baldwin,et al.  Automatic Labelling of Topic Models , 2011, ACL.

[38]  Ruifeng Xu,et al.  Automatic Labelling of Topic Models Learned from Twitter by Summarisation , 2014, ACL.

[39]  Mark Dredze,et al.  Topic Models and Metadata for Visualizing Text Corpora , 2013, NAACL.