TRIVIR: A Visualization System to Support Document Retrieval with High Recall

In this paper, we propose TRIVIR, a novel interactive visualization tool powered by an Information Retrieval (IR) engine that implements an active learning protocol to support IR with high recall. The system integrates multiple graphical views in order to assist the user identifying the relevant documents in a collection, including a content-based similarity map obtained with multidimensional projection techniques. Given representative documents as queries, users can interact with the views to label documents as relevant/not relevant, and this information is used to train a machine learning (ML) algorithm which suggests other potentially relevant documents on demand. TRIVIR offers two major advantages over existing visualization systems for IR. First, it merges the ML algorithm output into the visualization, while supporting several user interactions in order to enhance and speed up its convergence. Second, it tackles the problem of vocabulary mismatch, by providing term's synonyms and a view that conveys how the terms are used within the collection. Besides, TRIVIR has been developed as a flexible front-end interface that can be associated with distinct text representations and multidimensional projection techniques. We describe two use cases conducted with collaborators who are potential users of TRIVIR. Results show that the system simplified the search for relevant documents in large collections, based on the context in which the terms occur.

[1]  David R. Karger,et al.  Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections , 2017, SIGF.

[2]  Ben Shneiderman,et al.  Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization , 2012, J. Assoc. Inf. Sci. Technol..

[3]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[4]  Thomas Ertl,et al.  Visual Document Retrieval: Supporting Text Search and Analysis with Visual Analytics , 2013, Computing in Science & Engineering.

[5]  Eduard Hoenkamp,et al.  Live visual relevance feedback for query formulation , 2005, SIGIR '05.

[6]  Russ Burtner,et al.  Typograph: Multiscale spatial exploration of text documents , 2013, 2013 IEEE International Conference on Big Data.

[7]  Daniel A. Keim,et al.  Visual Analytics , 2009, Encyclopedia of Database Systems.

[8]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[9]  Maura R. Grossman,et al.  Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review , 2011 .

[10]  Chris North,et al.  Semantic interaction for visual text analytics , 2012, CHI.

[11]  Rosane Minghim,et al.  HiPP: A Novel Hierarchical Point Placement Strategy and its Application to the Exploration of Document Collections , 2008, IEEE Transactions on Visualization and Computer Graphics.

[12]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[13]  Earl Rennison,et al.  Galaxy of news: an approach to visualizing and understanding expansive news landscapes , 1994, UIST '94.

[14]  Daniel A. Keim,et al.  Visual Analytics: Definition, Process, and Challenges , 2008, Information Visualization.

[15]  Daniel A. Keim,et al.  Bridging Text Visualization and Mining: A Task-Driven Survey , 2019, IEEE Transactions on Visualization and Computer Graphics.

[16]  Haim Levkowitz,et al.  From Visual Data Exploration to Visual Data Mining: A Survey , 2003, IEEE Trans. Vis. Comput. Graph..

[17]  Stephen Clark,et al.  Specializing Word Embeddings for Similarity or Relatedness , 2015, EMNLP.

[18]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[19]  John T. Stasko,et al.  VisIRR: A Visual Analytics System for Information Retrieval and Recommendation for Large-Scale Document Data , 2018, ACM Trans. Knowl. Discov. Data.

[20]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[21]  Matthew Chalmers,et al.  Bead: explorations in information visualization , 1992, SIGIR '92.

[22]  Andreas Kerren,et al.  Visual Analysis of Relationships between Heterogeneous Networks and Texts: An Application on the IEEE VIS Publication Dataset , 2017, Informatics.

[23]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[24]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[25]  Jia-Kai Chou,et al.  PaperVis: Literature Review Made Easy , 2011, Comput. Graph. Forum.

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Chris North,et al.  Citiviz: A Visual User Interface to the CITIDEL System , 2004, ECDL.

[28]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[29]  Ari Korhonen,et al.  Platform for Elaboration of Search Results , 2007, WEBIST.

[30]  Tamara Munzner,et al.  Dimensionality reduction for documents with nearest neighbor queries , 2015, Neurocomputing.

[31]  Peter Willett,et al.  Document Retrieval Systems , 1988 .

[32]  Ammar Ismael Kadhim Survey on supervised machine learning techniques for automatic text classification , 2019, Artificial Intelligence Review.

[33]  Magdalena Jankowska,et al.  Relative N-gram signatures: Document visualization at the level of character N-grams , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[34]  Rosane Minghim,et al.  A visual analysis approach to validate the selection review of primary studies in systematic reviews , 2012, Inf. Softw. Technol..

[35]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[36]  Tuukka Ruotsalo,et al.  Visual Re-Ranking for Multi-Aspect Information Retrieval , 2017, CHIIR.

[37]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[38]  Robert R. Korfhage,et al.  Visualization of a Document Collection: The VIBE System , 1993, Inf. Process. Manag..

[39]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[40]  John T. Stasko,et al.  Jigsaw: Supporting Investigative Analysis through Interactive Visualization , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[41]  Katherine McDonough,et al.  cite2vec: Citation-Driven Document Exploration via Word Embeddings , 2017, IEEE Transactions on Visualization and Computer Graphics.

[42]  Chaomei Chen,et al.  CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature , 2006, J. Assoc. Inf. Sci. Technol..

[43]  Jimeng Sun,et al.  FacetAtlas: Multifaceted Visualization for Rich Text Corpora , 2010, IEEE Transactions on Visualization and Computer Graphics.

[44]  Donald Byrd,et al.  A scrollbar-based visualization for document navigation , 1999, DL '99.

[45]  Qi Han,et al.  DocuCompass: Effective exploration of document landscapes , 2016, 2016 IEEE Conference on Visual Analytics Science and Technology (VAST).

[46]  Wolfgang Kienreich,et al.  WebRat: supporting agile knowledge retrieval through dynamic, incremental clustering and automatic labelling of Web search result sets , 2003, WET ICE 2003. Proceedings. Twelfth IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, 2003..

[47]  Kyle Tilbury,et al.  Word Embeddings for Domain Specific Semantic Relatedness , 2018 .

[48]  Haim Levkowitz,et al.  Least Square Projection: A Fast High-Precision Multidimensional Projection Technique and Its Application to Document Mapping , 2008, IEEE Transactions on Visualization and Computer Graphics.

[49]  Elmar Eisemann,et al.  Approximated and User Steerable tSNE for Progressive Visual Analytics , 2015, IEEE Transactions on Visualization and Computer Graphics.

[50]  Charl P. Botha,et al.  PEx-WEB: Content-based Visualization of Web Search Results , 2008, 2008 12th International Conference Information Visualisation.

[51]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[52]  Lee M. Seversky,et al.  2 vec : Citation-Driven Document Exploration via Word Embeddings , 2016 .

[53]  John T. Stasko,et al.  Combining Computational Analyses and Interactive Visualization for Document Exploration and Sensemaking in Jigsaw , 2013, IEEE Transactions on Visualization and Computer Graphics.

[54]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[55]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[56]  Madhu Kumari,et al.  Synonyms Based Term Weighting Scheme: An Extension to TF.IDF , 2016 .

[57]  Bettina Berendt,et al.  STORIES in Time: A Graph-Based Interface for News Tracking and Discovery , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[58]  Rasoul Karimi,et al.  Active Learning for Recommender Systems , 2015, KI - Künstliche Intelligenz.

[59]  Chris North,et al.  Multi-model semantic interaction for text analytics , 2014, 2014 IEEE Conference on Visual Analytics Science and Technology (VAST).

[60]  GORDON V. CORMACK,et al.  Continuous Active Learning for TAR , 2016 .

[61]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[62]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[63]  W. Bruce Croft,et al.  Relevance-based Word Embedding , 2017, SIGIR.

[64]  David J. Harper,et al.  A language modelling approach to relevance profiling for document browsing , 2002, JCDL '02.

[65]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[66]  Andreas Dengel,et al.  Comparative Study between Traditional Machine Learning and Deep Learning Approaches for Text Classification , 2018, DocEng.

[67]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[68]  Elias Salomão Helou Neto,et al.  Similarity Preserving Snippet-Based Visualization of Web Search Results , 2014, IEEE Transactions on Visualization and Computer Graphics.

[69]  Tamara Munzner,et al.  Empirical Guidance on Scatterplot and Dimension Reduction Technique Choices , 2013, IEEE Transactions on Visualization and Computer Graphics.

[70]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[71]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .

[72]  Jun Zhang,et al.  A Novel Visualization Model for Web Search Results , 2006, IEEE Transactions on Visualization and Computer Graphics.

[73]  Rosane Minghim,et al.  An Approach to Supporting Incremental Visual Data Classification , 2015, IEEE Transactions on Visualization and Computer Graphics.

[74]  Jun'ichi Tatemura Graphical relevance feedback: visual exploration in the document space , 2000, Proceeding 2000 IEEE International Symposium on Visual Languages.

[75]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..