Visualising Large Document Collections by Jointly Modeling Text and Network Structure

Many large text collections exhibit graph structures, either inherent to the content itself or encoded in the metadata of the individual documents. Example graphs extracted from document collections are co-author networks, citation networks, or named-entity-cooccurrence networks. Furthermore, social networks can be extracted from email corpora, tweets, or social media. When it comes to visualising these large corpora, either the textual content or the network graph are used. In this paper, we propose to incorporate both, text and graph, to not only visualise the semantic information encoded in the documents' content but also the relationships expressed by the inherent network structure. To this end, we introduce a novel algorithm based on multi-objective optimisation to jointly position embedded documents and graph nodes in a two-dimensional landscape. We illustrate the effectiveness of our approach with real-world datasets and show that we can capture the semantics of large document collections better than other visualisations based on either the content or the network information.

[1]  Qiang Zhang,et al.  TIARA: a visual exploratory text analytic system , 2010, KDD '10.

[2]  Doug Downey,et al.  Construction of the Literature Graph in Semantic Scholar , 2018, NAACL.

[3]  Mark Coddington Clarifying Journalism’s Quantitative Turn , 2015 .

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Daniel Fried,et al.  Maps of Computer Science , 2013, 2014 IEEE Pacific Visualization Symposium.

[6]  Ludo Waltman,et al.  Visualizing Bibliometric Networks , 2014 .

[7]  Michael W. Mahoney,et al.  LASAGNE: Locality and Structure Aware Graph Node Embedding , 2017, 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[8]  Sargur N. Srihari,et al.  Computational Forensics: Towards Hybrid-Intelligent Crime Investigation , 2007, Third International Symposium on Information Assurance and Security.

[9]  Yifan Hu,et al.  MapSets: Visualizing Embedded and Clustered Graphs , 2014, J. Graph Algorithms Appl..

[10]  Yury Malkov,et al.  Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors , 2018, ECCV.

[11]  Jingzhou Liu,et al.  Visualizing Large-scale and High-dimensional Data , 2016, WWW.

[12]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[13]  Stefan Steinerberger,et al.  Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data , 2017, Nature Methods.

[14]  Andreas Noack,et al.  Unified quality measures for clusterings, layouts, and orderings of graphs, and their application as software design criteria , 2007 .

[15]  Gerhard Weikum,et al.  A Study of the Importance of External Knowledge in the Named Entity Recognition Task , 2018, ACL.

[16]  Ulrik Brandes,et al.  Flexible Level-of-Detail Rendering for Large Graphs , 2016, GD 2016.

[17]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[18]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[19]  Jure Leskovec,et al.  Overlapping Communities Explain Core–Periphery Organization of Networks , 2014, Proceedings of the IEEE.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Kwan-Liu Ma,et al.  Contact Trees: Network Visualization beyond Nodes and Edges , 2016, PloS one.

[22]  Angus G. Forbes,et al.  CactusTree: A tree drawing approach for hierarchical edge bundling , 2017, 2017 IEEE Pacific Visualization Symposium (PacificVis).

[23]  M. Jacomy,et al.  ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software , 2014, PloS one.

[24]  Peter Eades,et al.  Towards Faithful Graph Visualizations , 2017, ArXiv.

[25]  Michael Jünger,et al.  Drawing Clustered Graphs as Topographic Maps , 2012, Graph Drawing.

[26]  Kevin Chen-Chuan Chang,et al.  Learning Community Embedding with Community Detection and Node Embedding on Graphs , 2017, CIKM.

[27]  Christin Seifert,et al.  On Joint Representation Learning of Network Structure and Document Content , 2017, CD-MAKE.

[28]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[29]  Charu C. Aggarwal,et al.  Heterogeneous Network Embedding via Deep Architectures , 2015, KDD.

[30]  Helen C. Purchase,et al.  Metrics for Graph Drawing Aesthetics , 2002, J. Vis. Lang. Comput..

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  Arputharaj Kannan,et al.  An Intelligent System for Semantic Information Retrieval Information from Textual Web Documents , 2008, IWCF.

[33]  Jing Hua,et al.  Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115) , 2009, IEEE Transactions on Visualization and Computer Graphics.

[34]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[35]  Chengqi Zhang,et al.  Network Representation Learning: A Survey , 2017, IEEE Transactions on Big Data.

[36]  Bret Jackson,et al.  Cartograph: Unlocking Spatial Visualization Through Semantic Enhancement , 2017, IUI.

[37]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[38]  Ralf Krestel,et al.  Bringing Back Structure to Free Text Email Conversations with Recurrent Neural Networks , 2018, ECIR.

[39]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[40]  Jie Liu,et al.  Content to Node: Self-Translation Network Embedding , 2018, IEEE Transactions on Knowledge and Data Engineering.

[41]  Elmar Eisemann,et al.  Approximated and User Steerable tSNE for Progressive Visual Analytics , 2015, IEEE Transactions on Visualization and Computer Graphics.

[42]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[43]  Yue Zhang,et al.  Using Structured Events to Predict Stock Price Movement: An Empirical Investigation , 2014, EMNLP.

[44]  Ralf Krestel,et al.  Exploration Interface for Jointly Visualised Text and Graph Data , 2020, IUI Companion.

[45]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[46]  Bin Pang,et al.  Creating realistic map-like visualisations: Results from user studies , 2017, J. Vis. Lang. Comput..

[47]  Marie-Anne Chabin PANAMA PAPERS: A CASE STUDY FOR RECORDS MANAGEMENT? , 2017 .

[48]  Dunja Mladenic,et al.  Visualization of Text Document Corpus , 2005, Informatica.

[49]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[50]  Stefan Steinerberger,et al.  Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations , 2019, ECML/PKDD.