Text Network Exploration via Heterogeneous Web of Topics

A text network refers to a data type that each vertex is associated with a text document and the relationship between documents is represented by edges. The proliferation of text networks such as hyperlinked webpages and academic citation networks has led to an increasing demand for quickly developing a general sense of a new text network, namely text network exploration. In this paper, we address the problem of text network exploration through constructing a heterogeneous web of topics, which allows people to investigate a text network associating word level with document level. To achieve this, a probabilistic generative model for text and links is proposed, where three different relationships in the heterogeneous topic web are quantified. We also develop a prototype demo system named TopicAtlas to exhibit such heterogeneous topic web, and demonstrate how this system can facilitate the task of text network exploration. Extensive qualitative analyses are included to verify the effectiveness of this heterogeneous topic web. Besides, we validate our model on real-life text networks, showing that it preserves good performance on objective evaluation metrics.

[1]  Ramesh Nallapati,et al.  TopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked Documents , 2011, AISTATS.

[2]  Michael Gleicher,et al.  Serendip: Topic model-driven visual exploration of text corpora , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[3]  Xiaolong Wang,et al.  Understanding evolution of research themes: a probabilistic generative model for citations , 2013, KDD.

[4]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[5]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[6]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[7]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[8]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[11]  Thomas M. Lento,et al.  Topic-Based Clusters in Egocentric Networks on Facebook , 2014, ICWSM.

[12]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[13]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[14]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[15]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[16]  Patrick Jähnichen,et al.  Exploratory Search Through Visual Analysis of Topic Models , 2017, Digit. Humanit. Q..

[17]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Heng Ji,et al.  Constructing Topical Hierarchies in Heterogeneous Information Networks , 2013, ICDM.

[19]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[20]  Robert M. Rolfe,et al.  Topic similarity networks: Visual analytics for large document sets , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[21]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[22]  Gary Marchionini,et al.  Exploratory search , 2006, Commun. ACM.

[23]  Ramesh Nallapati,et al.  Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs , 2021, ICWSM.

[24]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[25]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Padhraic Smyth,et al.  TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling , 2012, TIST.

[27]  Jacob Eisenstein,et al.  Exploratory Thematic Analysis for Digitized Archival Collections , 2015, Digit. Scholarsh. Humanit..

[28]  Hady Wirawan Lauw,et al.  Probabilistic Latent Document Network Embedding , 2014, 2014 IEEE International Conference on Data Mining.

[29]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[30]  Stéfan Sinclair Computer-Assisted Reading: Reconceiving Text Analysis , 2003, Lit. Linguistic Comput..

[31]  Jian Pei,et al.  Detecting topic evolution in scientific literature: how can citations help? , 2009, CIKM.