Semantic visualization for spherical representation

Visualization of high-dimensional data such as text documents is widely applicable. The traditional means is to find an appropriate embedding of the high-dimensional representation in a low-dimensional visualizable space. As topic modeling is a useful form of dimensionality reduction that preserves the semantics in documents, recent approaches aim for a visualization that is consistent with both the original word space, as well as the semantic topic space. In this paper, we address the semantic visualization problem. Given a corpus of documents, the objective is to simultaneously learn the topic distributions as well as the visualization coordinates of documents. We propose to develop a semantic visualization model that approximates L2-normalized data directly. The key is to associate each document with three representations: a coordinate in the visualization space, a multinomial distribution in the topic space, and a directional vector in a high-dimensional unit hypersphere in the word space. We join these representations in a unified generative model, and describe its parameter estimation through variational inference. Comprehensive experiments on real-life text datasets show that the proposed method outperforms the existing baselines on objective evaluation metrics for visualization quality and topic interpretability.

[1]  Pierre Comon Independent component analysis - a new concept? signal processing , 1994 .

[2]  K. Mardia Distribution Theory for the Von Mises-Fisher Distribution and Its Application , 1975 .

[3]  William Ribarsky,et al.  ParallelTopics: A probabilistic approach to exploring document collections , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[4]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[5]  Gilbert L. Peterson,et al.  Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps , 2009, FLAIRS.

[6]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[7]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[8]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[9]  P. Comon Independent Component Analysis , 1992 .

[10]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .

[11]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[12]  David Newman,et al.  External evaluation of topic models , 2009 .

[13]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[14]  Nebojsa Jojic,et al.  Documents as multiple overlapping windows into grids of counts , 2013, NIPS.

[15]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[16]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[17]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[18]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[19]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[20]  Padhraic Smyth,et al.  TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling , 2012, TIST.

[21]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[22]  Martin D. Buhmann,et al.  Radial Basis Functions , 2021, Encyclopedia of Mathematical Geosciences.

[23]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[24]  Hady Wirawan Lauw,et al.  Manifold Learning for Jointly Modeling Topic and Visualization , 2014, AAAI.

[25]  Thomas L. Griffiths,et al.  Parametric Embedding for Class Visualization , 2004, Neural Computation.

[26]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[27]  Bryan Silverthorn,et al.  Spherical Topic Models , 2010, ICML.

[28]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Qiang Zhang,et al.  TIARA: a visual exploratory text analytic system , 2010, KDD '10.

[31]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[32]  Naonori Ueda,et al.  Probabilistic latent semantic visualization: topic model for visualizing documents , 2008, KDD.

[33]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[34]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[35]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[36]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.