Normalized compression distance for visual analysis of document collections

In a world flooded by text of various sources, it is of strategic importance to find ways to map information present in written documents in a form that helps users locate and associate important information within a particular text data set. Content-based maps can support extremely useful explorations of text data sets. This paper proposes and evaluates the use of Kolmogorov complexity approximations as a means to detect similarity between general textual documents, in order to support mapping and visualization techniques for corpora exploration. The calculation of this similarity measure requires no intermediate representation of a corpus (such as vector representation) and therefore no pre-processing or parametrization steps. That makes it very attractive for a wider range of exploratory applications compared to conventional measures that need vector-based text representations. The visual layout used here is based on fast distance multi-dimensional projections. It is shown that the similarity measure and the resulting maps present very good precision and that the approach can be used successfully for visual analysis of automatically generated text maps.

[1]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[2]  Vinicius Veloso de Melo,et al.  Mapping texts through dimensionality reduction and visualization techniques for interactive exploration of document collections , 2006, Electronic Imaging.

[3]  Matthew Chalmers,et al.  Bead: explorations in information visualization , 1992, SIGIR '92.

[4]  James A. Wise The ecological approach to text visualization , 1999 .

[5]  George Karypis,et al.  gCLUTO – An Interactive Clustering, Visualization, and Analysis System , 2004 .

[6]  Pak Chung Wong,et al.  TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system , 1998 .

[7]  Massimo Ruffolo,et al.  Managing the knowledge contained in electronic documents: a clustering method for text mining , 2001, 12th International Workshop on Database and Expert Systems Applications.

[8]  Edgar R. Weippl Visualizing content based relations in texts , 2001, Proceedings Second Australasian User Interface Conference. AUIC 2001.

[9]  Stefan Rüger,et al.  Info Navigator: A visualization tool for document searching and browsing , 2003 .

[10]  Matthew Chalmers,et al.  A linear iteration time layout algorithm for visualising high-dimensional data , 1996, Proceedings of Seventh Annual IEEE Visualization '96.

[11]  Hirosuke Yamamoto,et al.  Asymptotic properties on codeword lengths of an optimal FV code for general sources , 2005, IEEE Transactions on Information Theory.

[12]  Rosane Minghim,et al.  Visual Mapping of Text Collections through a Fast High Precision Projection Technique , 2006, Tenth International Conference on Information Visualisation (IV'06).

[13]  Rosane Minghim,et al.  Text Map Explorer: a Tool to Create and Explore Document Maps , 2006, Tenth International Conference on Information Visualisation (IV'06).

[14]  William I. Gasarch,et al.  Book Review: An introduction to Kolmogorov Complexity and its Applications Second Edition, 1997 by Ming Li and Paul Vitanyi (Springer (Graduate Text Series)) , 1997, SIGACT News.

[15]  Lucy T. Nowell,et al.  ThemeRiver: Visualizing Thematic Changes in Large Document Collections , 2002, IEEE Trans. Vis. Comput. Graph..

[16]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[17]  Matthew Chalmers,et al.  Using a Landscape Methaphor to Represent a Corpus of Documents , 1993, COSIT.

[18]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[19]  David S. Ebert,et al.  The shape of Shakespeare: visualizing text using implicit surfaces , 1998, Proceedings IEEE Symposium on Information Visualization (Cat. No.98TB100258).

[20]  Rosane Minghim,et al.  Content-based text mapping using multi-dimensional projections for exploration of document collections , 2006, Electronic Imaging.

[21]  Wolfgang Kienreich,et al.  Evaluating a System for Interactive Exploration of Large, Hierarchically Structured Document Repositories , 2004 .

[22]  Mark Greaves,et al.  Visualizing text data sets , 1999, Comput. Sci. Eng..

[23]  Daniel Cohen-Or,et al.  Least-squares meshes , 2004, Proceedings Shape Modeling Applications, 2004..

[24]  Marc M. Sebrechts,et al.  Visualization of search results: a comparative evaluation of text, 2D, and 3D interfaces , 1999, SIGIR '99.

[25]  Haim Levkowitz,et al.  Spider Cursor: a simple versatile interaction tool for data visualization and exploration , 2005, GRAPHITE '05.

[26]  Ricardo A. Baeza-Yates,et al.  Alternative implementation techniques for Web text visualization , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[27]  Ricardo A. Baeza-Yates Visualization of large answers in text databases , 1996, AVI '96.

[28]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[29]  Chaomei Chen,et al.  Visualizing knowledge domains , 2005, Annu. Rev. Inf. Sci. Technol..

[30]  James Allan,et al.  Lighthouse: showing the way to relevant information , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[31]  Rosane Minghim,et al.  On Improved Projection Techniques to Support Visual Exploration of Multi-Dimensional Data Sets , 2003, Inf. Vis..

[32]  Wolfgang Kienreich,et al.  The InfoSky visual explorer: Exploiting Hierarchical Structure and Document Similarities , 2002, Inf. Vis..

[33]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[34]  Shiping Huang,et al.  Exploration of dimensionality reduction for text visualization , 2005, Coordinated and Multiple Views in Exploratory Visualization (CMV'05).

[35]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[36]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.