Structure-Aware Visualization of Text Corpora

Trying to comprehend the structure and content of large text corpora can be a daunting and often time consuming task. In this paper, we introduce a novel tool that exploits the structural properties for extracting and visualizing the underlying topics in a given dataset. To this end, we make use of a combination of latent topic analysis, discriminative feature selection applied on top of the category structure of corpora, and various ranking methods in order to extract the most representative topics for a given corpus. The visual moniker to depict the outcome of these methods can be chosen based on the context. Such visual representations can be useful for depicting trends, identifying ``hot'' topics, and discovering interesting patterns in the underlying data. As applications, we create example representations for a variety of corpora obtained from conference proceedings, movie summaries, and newsgroup postings. Our user experiments demonstrate the viability of our approach, with a flower-like visualization inspired by the ``wheel of emotion'', for generating high quality representative topics and for unearthing hidden structures and connections in large document corpora.

[1]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[2]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[3]  David Carmel,et al.  Enhancing cluster labeling using wikipedia , 2009, SIGIR.

[4]  W. G. Parrott,et al.  Emotions in social psychology : essential readings , 2001 .

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[7]  Teuvo Kohonen,et al.  Self-Organization of Very Large Document Collections: State of the Art , 1998 .

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[10]  R. Plutchik Emotions : a general psychoevolutionary theory , 1984 .

[11]  Daniela Karin Rosner,et al.  Tag Clouds: Data Analysis Tool or Social Signaller? , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[12]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[13]  Martin Wattenberg,et al.  Parallel Tag Clouds to explore and analyze faceted text corpora , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[14]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[15]  E. C. Pielou The measurement of diversity in different types of biological collections , 1966 .

[16]  Michael Cardew-Hall,et al.  The folksonomy tag cloud: when is it useful? , 2008, J. Inf. Sci..

[17]  William Ribarsky,et al.  ParallelTopics: A probabilistic approach to exploring document collections , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[18]  Chris North,et al.  An Insight-Based Longitudinal Study of Visual Analytics , 2006, IEEE Transactions on Visualization and Computer Graphics.

[19]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[20]  Stefan Siersdorfer,et al.  Efficient jaccard-based diversity analysis of large document collections , 2012, CIKM.

[21]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[22]  Teuvo Kohonen,et al.  Exploration of very large databases by self-organizing maps , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Manfred Tscheligi,et al.  Semantically structured tag clouds: an empirical evaluation of clustered presentation approaches , 2009, CHI.

[25]  Fernanda B. Viégas,et al.  Visualizing email content: portraying relationships from conversational histories , 2006, CHI.

[26]  Dimitrios Skoutas,et al.  Tag clouds revisited , 2011, CIKM '11.

[27]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  M. Sheelagh T. Carpendale,et al.  Empirical Studies in Information Visualization: Seven Scenarios , 2012, IEEE Transactions on Visualization and Computer Graphics.

[29]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[30]  Heike Hofmann,et al.  Graphics of Large Datasets: Visualizing a Million , 2006 .

[31]  Shimei Pan,et al.  Interactive, topic-based visual text summarization and analysis , 2009, CIKM.

[32]  Jing Hua,et al.  Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115) , 2009, IEEE Transactions on Visualization and Computer Graphics.

[33]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[34]  Dafna Shahaf,et al.  Information cartography: creating zoomable, large-scale maps of information , 2013, KDD.

[35]  Chris North,et al.  A comparison of benchmark task and insight evaluation methods for information visualization , 2011, Inf. Vis..

[36]  Tulay Koru-Sengul,et al.  Graphics of Large Datasets: Visualizing a Million , 2007, Technometrics.

[37]  Matt Gardner The Topic Browser An Interactive Tool for Browsing Topic Models , 2010 .

[38]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[39]  M. Sheelagh T. Carpendale,et al.  SparkClouds: Visualizing Trends in Tag Clouds , 2010, IEEE Transactions on Visualization and Computer Graphics.

[40]  Feng-Jen Yang,et al.  An Overview of Intelligent Tutoring Systems , 2007, IC-AI.

[41]  Naonori Ueda,et al.  Probabilistic latent semantic visualization: topic model for visualizing documents , 2008, KDD.

[42]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.