Interpretation and trust: designing model-driven visualizations for text analysis

Statistical topic models can help analysts discover patterns in large text corpora by identifying recurring sets of words and enabling exploration by topical concepts. However, understanding and validating the output of these models can itself be a challenging analysis task. In this paper, we offer two design considerations - interpretation and trust - for designing visualizations based on data-driven models. Interpretation refers to the facility with which an analyst makes inferences about the data through the lens of a model abstraction. Trust refers to the actual and perceived accuracy of an analyst's inferences. These considerations derive from our experiences developing the Stanford Dissertation Browser, a tool for exploring over 9,000 Ph.D. theses by topical similarity, and a subsequent review of existing literature. We contribute a novel similarity measure for text collections based on a notion of "word-borrowing" that arose from an iterative design process. Based on our experiences and a literature review, we distill a set of design recommendations and describe how they promote interpretable and trustworthy visual analysis tools.

[1]  Qiang Zhang,et al.  TIARA: a visual exploratory text analytic system , 2010, KDD '10.

[2]  Andrew McCallum,et al.  Database of NIH grants using machine-learned categories and graphical clustering , 2011, Nature Methods.

[3]  Kevin W. Boyack,et al.  Mapping the structure and evolution of chemistry research , 2009, Scientometrics.

[4]  M. Sheelagh T. Carpendale,et al.  DocuBurst: Visualizing Document Content using Language Structure , 2009, Comput. Graph. Forum.

[5]  Alex Voss,et al.  Riot rumours: how misinformation spread on Twitter during a time of crisis , 2011 .

[6]  Robert L. Goldstone,et al.  The simultaneous evolution of author and paper networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Daniel A. Keim,et al.  Visual Sentiment Analysis of RSS News Feeds Featuring the US Presidential Election in 2008 , 2009 .

[8]  Paul Clough,et al.  Evaluating Tagclouds for Health-Related Information Research , 2008 .

[9]  Michael Cardew-Hall,et al.  The folksonomy tag cloud: when is it useful? , 2008, J. Inf. Sci..

[10]  Kevin W. Boyack,et al.  Domain visualization using VxInsight® for science and technology management , 2002, J. Assoc. Inf. Sci. Technol..

[11]  Michael J. Muller,et al.  Getting our head in the clouds: toward evaluation studies of tagclouds , 2007, CHI.

[12]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[13]  Koji Yatani,et al.  Review spotlight: a user interface for summarizing user-generated reviews using adjective-noun word pairs , 2011, CHI.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Martin Wattenberg,et al.  TIMELINESTag clouds and the case for vernacular visualization , 2008, INTR.

[16]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[17]  Kevin W. Boyack,et al.  Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches , 2011, PloS one.

[18]  James D. Hollan,et al.  Direct Manipulation Interfaces , 1985, Hum. Comput. Interact..

[19]  Zaida Chinchilla-Rodríguez,et al.  Visualizing the marrow of science , 2007 .

[20]  Christian Posse,et al.  IN-SPIRE InfoVis 2004 Contest Entry , 2004 .

[21]  Ratul Mahajan,et al.  CueT: human-guided fast and accurate network alarm triage , 2011, CHI.

[22]  Neil R. Smalheiser,et al.  Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE , 2009, Comput. Methods Programs Biomed..

[23]  Martin Wattenberg,et al.  Mapping Text with Phrase Nets , 2009, IEEE Transactions on Visualization and Computer Graphics.

[24]  Chaomei Chen,et al.  CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature , 2006, J. Assoc. Inf. Sci. Technol..

[25]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[27]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[28]  James D. Hollan,et al.  Pad++: a zooming graphical interface for exploring alternate interface physics , 1994, UIST '94.

[29]  Kristin A. Cook,et al.  Illuminating the Path: The Research and Development Agenda for Visual Analytics , 2005 .

[30]  Weimao Ke,et al.  Dynamicity vs. effectiveness: studying online clustering for scatter/gather , 2009, SIGIR.

[31]  Kevin W. Boyack,et al.  Mapping Medline papers, genes, and proteins related to melanoma research , 2004, Proceedings. Eighth International Conference on Information Visualisation, 2004. IV 2004..

[32]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[33]  Darrell Laham,et al.  From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Richard May,et al.  The STARLIGHT information visualization system , 1997, Proceedings. 1997 IEEE Conference on Information Visualization (Cat. No.97TB100165).

[35]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[36]  Martin Wattenberg,et al.  The Word Tree, an Interactive Visual Concordance , 2008, IEEE Transactions on Visualization and Computer Graphics.

[37]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[38]  Jerry Alan Fails,et al.  Interactive machine learning , 2003, IUI '03.

[39]  X. Lin,et al.  Visualization for the document space , 1992, Proceedings Visualization '92.

[40]  Daniel M. Dunlavy,et al.  LSAView: A tool for visual exploration of latent semantic modeling , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[41]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[42]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[43]  Tamara Munzner,et al.  A Nested Model for Visualization Design and Validation , 2009, IEEE Transactions on Visualization and Computer Graphics.

[44]  Martin Wattenberg,et al.  Parallel Tag Clouds to explore and analyze faceted text corpora , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[45]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[46]  Pamela Effrein Sandstrom,et al.  Scholarly communication as a socioecological system , 2001, Scientometrics.

[47]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[48]  Jimeng Sun,et al.  FacetAtlas: Multifaceted Visualization for Rich Text Corpora , 2010, IEEE Transactions on Visualization and Computer Graphics.

[49]  Padhraic Smyth,et al.  TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling , 2012, TIST.

[50]  D. Steinberg,et al.  Technometrics , 2008 .

[51]  John T. Stasko,et al.  Jigsaw: Supporting Investigative Analysis through Interactive Visualization , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[52]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[53]  Matt Gardner The Topic Browser An Interactive Tool for Browsing Topic Models , 2010 .

[54]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[55]  Richard A. Becker,et al.  Brushing scatterplots , 1987 .