Exploration of dimensionality reduction for text visualization

In the text document visualization community, statistical analysis tools (e.g., principal component analysis and multidimensional scaling) and neurocomputation models (e.g., self-organizing feature maps) have been widely used for dimensionality reduction. Often the resulting dimensionality is set to two, as this facilitates plotting the results. The validity and effectiveness of these approaches largely depend on the specific data sets used and semantics of the targeted applications. To date, there has been little evaluation to assess and compare dimensionality reduction methods and dimensionality reduction processes, either numerically or empirically. The focus of this paper is to propose a mechanism for comparing and evaluating the effectiveness of dimensionality reduction techniques in the visual exploration of text document archives. We use multivariate visualization techniques and interactive visual exploration to study three problems: (a) Which dimensionality reduction technique best preserves the interrelationships within a set of text documents; (b) What is the sensitivity of the results to the number of output dimensions; (c) Can we automatically remove redundant or unimportant words from the vector extracted from the documents while still preserving the majority of information, and thus make dimensionality reduction more efficient. To study each problem, we generate supplemental dimensions based on several dimensionality reduction algorithms and parameters controlling these algorithms. We then visually analyze and explore the characteristics of the reduced dimensional spaces as implemented within a linked, multiview multidimensional visual exploration tool, XmdvTool. We compare the derived dimensions to features known to be present in the original data. Quantitative measures are also used in identifying the quality of results using different numbers of output dimensions.

[1]  Timo Honkela,et al.  Very Large Two-Level SOM for the Browsing of Newsgroups , 1996, ICANN.

[2]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[3]  Matthew O. Ward,et al.  N-land: a graphical tool for ex-ploring n-dimensional data , 1996 .

[4]  George Karypis,et al.  Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization , 2000 .

[5]  Antoine Choppin,et al.  Unsupervised Classification of High Dimensional Data by means of Self-Organizing Neural Networks , 1998 .

[6]  Arthur Flexer,et al.  On the use of self-organizing maps for clustering and visualization , 1999, Intell. Data Anal..

[7]  Stefan Berchtold,et al.  Similarity clustering of dimensions for an enhanced visualization of multidimensional data , 1998, Proceedings IEEE Symposium on Information Visualization (Cat. No.98TB100258).

[8]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[9]  Matthew O. Ward,et al.  Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate data sets , 2003, Comput. Graph..

[10]  Timo Honkela,et al.  Self-Organizing Maps of Very Large Document Collections: Justification for the WEBSOM Method , 1998 .

[11]  Matthew O. Ward,et al.  InterRing: an interactive tool for visually navigating and manipulating hierarchical structures , 2002, IEEE Symposium on Information Visualization, 2002. INFOVIS 2002..

[12]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[13]  James A. Wise,et al.  The Ecological Approach to Text Visualization , 1999, J. Am. Soc. Inf. Sci..

[14]  Matthew O. Ward,et al.  XmdvTool: integrating multiple methods for visualizing multivariate data , 1994, Proceedings Visualization '94.

[15]  T. Kohonen,et al.  Self-organizing semantic maps , 1989, Biological Cybernetics.

[16]  Ryen W. White,et al.  Finding relevant documents using top ranking sentences: an evaluation of two alternative schemes , 2002, SIGIR '02.

[17]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[18]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[19]  Naftali Tishby,et al.  Sufficient Dimensionality Reduction - A novel Analysis Method , 2002, ICML.

[20]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[21]  Ingrid Renz,et al.  Daimler Benz Research: System and Experiments Routing and Filtering , 1997, TREC.

[22]  Pak Chung Wong,et al.  Human—Computer Interaction with Global Information Spaces — Beyond Data Mining , 2000 .

[23]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[24]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[25]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[26]  Lucy T. Nowell,et al.  ThemeRiver: Visualizing Thematic Changes in Large Document Collections , 2002, IEEE Trans. Vis. Comput. Graph..

[27]  Wojciech Basalaj Proximity visualisation of abstract data , 2001 .

[28]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[29]  Matthew O. Ward,et al.  High Dimensional Brushing for Interactive Exploration of Multivariate Data , 1995, Proceedings Visualization '95.

[30]  Jeanny Hérault,et al.  Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets , 1997, IEEE Trans. Neural Networks.

[31]  Matthew O. Ward,et al.  Visual Hierarchical Dimension Reduction for Exploration of High Dimensional Datasets , 2003, VisSym.

[32]  Arthur Flexer On the use of self-organizing maps for clustering and visualization , 2001 .

[33]  Timo Honkela,et al.  Exploration of full-text databases with self-organizing maps , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).