TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling

We present TopicNets, a Web-based system for visual and interactive analysis of large sets of documents using statistical topic models. A range of visualization types and control mechanisms to support knowledge discovery are presented. These include corpus- and document-specific views, iterative topic modeling, search, and visual filtering. Drill-down functionality is provided to allow analysts to visualize individual document sections and their relations within the global topic space. Analysts can search across a dataset through a set of expansion techniques on selected document and topic nodes. Furthermore, analysts can select relevant subsets of documents and perform real-time topic modeling on these subsets to interactively visualize topics at various levels of granularity, allowing for a better understanding of the documents. A discussion of the design and implementation choices for each visual analysis technique is presented. This is followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find. These include a corpus of 50,000 successful NSF grant proposals, 10,000 publications from a large research center, and single documents including a grant proposal and a PhD thesis.

[1]  Pak Chung Wong,et al.  TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system , 1998 .

[2]  Jimeng Sun,et al.  FacetAtlas: Multifaceted Visualization for Rich Text Corpora , 2010, IEEE Transactions on Visualization and Computer Graphics.

[3]  HermanIvan,et al.  Graph Visualization and Navigation in Information Visualization , 2000 .

[4]  Yiannis Kompatsiaris,et al.  Towards content-oriented patent document processing , 2008 .

[5]  Martin Wattenberg,et al.  Mapping Text with Phrase Nets , 2009, IEEE Transactions on Visualization and Computer Graphics.

[6]  Lucy T. Nowell,et al.  ThemeRiver: Visualizing Thematic Changes in Large Document Collections , 2002, IEEE Trans. Vis. Comput. Graph..

[7]  W. Scott Spangler,et al.  MindMap: utilizing multiple taxonomies and visualization to understand a document collection , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[8]  Steven K. Feiner,et al.  View management for virtual and augmented reality , 2001, UIST '01.

[9]  Robert Tobias,et al.  Interactive Manipulation of Large Graph Layouts , 2008 .

[10]  Tobias Höllerer,et al.  WiGis: A Framework for Scalable Web-Based Interactive Graph Visualizations , 2009, Graph Drawing.

[11]  Karen Spärck Jones Automatic summarising: The state of the art , 2007, Inf. Process. Manag..

[12]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[13]  Philippe Dessus An Overview of LSA-Based Systems for Supporting Learning and Teaching , 2009, AIED.

[14]  Martin Wattenberg,et al.  ManyEyes: a Site for Visualization at Internet Scale , 2007, IEEE Transactions on Visualization and Computer Graphics.

[15]  Gilad Mishne,et al.  MoodViews: Tracking and Searching Mood-Annotated Blog Posts , 2007, ICWSM.

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Duncan Temple Lang,et al.  GGobi: evolving from XGobi into an extensible framework for interactive data visualization , 2003, Comput. Stat. Data Anal..

[18]  Timothy Baldwin,et al.  Visualizing search results and document collections using topic maps , 2010, J. Web Semant..

[19]  Ivan Herman,et al.  Graph Visualization and Navigation in Information Visualization: A Survey , 2000, IEEE Trans. Vis. Comput. Graph..

[20]  Henry Lieberman,et al.  Finding your way in a multi-dimensional semantic space with luminoso , 2010, IUI '10.

[21]  Tim Dwyer,et al.  Scalable, Versatile and Simple Constrained Graph Layout , 2009, Comput. Graph. Forum.

[22]  Peter Eades,et al.  A Heuristic for Graph Drawing , 1984 .

[23]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[26]  Thomas L. Griffiths,et al.  Parametric Embedding for Class Visualization , 2004, Neural Computation.

[27]  Tobias Höllerer,et al.  SmallWorlds: Visualizing Social Recommendations , 2010, Comput. Graph. Forum.

[28]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[29]  Robert R. Korfhage,et al.  Visualization of a Document Collection: The VIBE System , 1993, Inf. Process. Manag..

[30]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[31]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[32]  Harri Siirtola,et al.  Visual Perception of Parallel Coordinate Visualizations , 2009, 2009 13th International Conference Information Visualisation.

[33]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[34]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[35]  Ben Shneiderman,et al.  Readings in information visualization - using vision to think , 1999 .

[36]  Vladimir Batagelj,et al.  Pajek - Program for Large Network Analysis , 1999 .

[37]  Christian Posse,et al.  IN-SPIRE InfoVis 2004 Contest Entry , 2004 .

[38]  Naonori Ueda,et al.  Probabilistic latent semantic visualization: topic model for visualizing documents , 2008, KDD.

[39]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.

[40]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[41]  Haipeng Shen,et al.  Analysis of call centre arrival data using singular value decomposition: Research Articles , 2005 .

[42]  Ulrich Lauther,et al.  Multipole-Based Force Approximation Revisited - A Simple but Fast Implementation Using a Dynamized Enclosing-Circle-Enhanced k-d-Tree , 2006, GD.

[43]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[44]  Ellen Finnie Duranceau Books That Matter -- A Review of Ambient Findability: What We Find Changes Who We Become, by Peter Morville (O'Reilly Media: Sebastopol, CA, c2005, 188p. ISBN: 059600765-5) , 2013 .

[45]  Peter Morville Ambient findability - what we find changes who we become , 2005 .

[46]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[47]  Peter Eades,et al.  Journal of Graph Algorithms and Applications Navigating Clustered Graphs Using Force-directed Methods , 2022 .

[48]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[49]  Bin Zhu,et al.  Newsmap: a knowledge map for online news , 2005, Decis. Support Syst..

[50]  Timothy Baldwin,et al.  Evaluating topic models for digital libraries , 2010, JCDL '10.

[51]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[52]  Cláudio T. Silva,et al.  Towards Provenance-Enabling ParaView , 2008, IPAW.

[53]  Ioannis G. Tollis,et al.  On labeling in graph visualization , 2007, Inf. Sci..

[54]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[55]  Shimei Pan,et al.  Interactive, topic-based visual text summarization and analysis , 2009, CIKM.

[56]  Martin Wattenberg,et al.  Your place or mine?: visualization as a community component , 2008, CHI.

[57]  Ben Shneiderman,et al.  ManyNets: an interface for multiple network analysis and visualization , 2010, CHI.

[58]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[59]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[60]  Hai Yang,et al.  ACM Transactions on Intelligent Systems and Technology - Special Section on Urban Computing , 2014 .

[61]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[62]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[63]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[64]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[65]  Barry Smyth,et al.  A Visual Interface for Social Information Filtering , 2009, 2009 International Conference on Computational Science and Engineering.

[66]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[67]  Graham J. Wills,et al.  Navigating large networks with hierarchies , 1993, Proceedings Visualization '93.

[68]  Richard N. Taylor,et al.  Software traceability with topic modeling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[69]  Wolfgang Kienreich,et al.  The InfoSky visual explorer: Exploiting Hierarchical Structure and Document Similarities , 2002, Inf. Vis..

[70]  Haipeng Shen,et al.  Analysis of call centre arrival data using singular value decomposition , 2005 .

[71]  John T. Stasko,et al.  Jigsaw: Supporting Investigative Analysis through Interactive Visualization , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.