Context-driven Dimensionality Reduction for Clustering Text Documents

We investigate clustering documents based on automatically annotated potentially sensitive information extracted from a large collection of organizational data. The process of clustering in this particular use case is helpful to visualize and navigate through groups of documents with related content. However, the effectiveness and efficiency of document clustering is limited mainly due to the large dimensionality of the document vectors. To alleviate this problem we propose a dimensionality reduction approach which involves selecting terms with high tf-idf scores from the context of the automatically annotated sensitive regions of a document. Due to the unavailability of real organizational data for research purposes, we evaluate our approach on the standard 20 news-groups dataset. For evaluation purposes, the only sensitive information that we use from the documents of this dataset are the named entities, e.g. the names of persons and organizations. Experimental results show that our approach is able to achieve an almost perfect clustering with a purity value of 0.998 improving by 22.60% with respect to the purity value of 0.814 obtained without document dimensionality reduction.

[1]  Ivan Herman,et al.  Graph Visualization and Navigation in Information Visualization: A Survey , 2000, IEEE Trans. Vis. Comput. Graph..

[2]  Maged N Kamel Boulos,et al.  The use of interactive graphical maps for browsing medical/health Internet information resources , 2003, International journal of health geographics.

[3]  Malcolm I. Heywood,et al.  Comparing Dimension Reduction Techniques for Document Clustering , 2005, Canadian Conference on AI.

[4]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[5]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[6]  Malcolm I. Heywood,et al.  Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering , 2005 .

[7]  Sunghae Jun,et al.  Document clustering method using dimension reduction and support vector clustering to overcome sparseness , 2014, Expert Syst. Appl..

[8]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[9]  Danny Holten,et al.  Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[10]  Catherine Plaisant,et al.  Navigation patterns and usability of zoomable user interfaces with and without an overview , 2002, TCHI.

[11]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[12]  Doug Kimelman,et al.  Reduction of Visual Complexity in Dynamic Graphs , 1994, Graph Drawing.

[13]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[14]  Ricardo A. Baeza-Yates,et al.  Modeling user search behavior , 2005, Third Latin American Web Congress (LA-WEB'2005).

[15]  Sougata Mukherjea,et al.  Interactive clustering for navigating in hypermedia systems , 1994, ECHT '94.

[16]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[17]  Arunabha Sen,et al.  Graph Clustering Using Multiway Ratio Cut , 1997, GD.

[18]  Stefan Rüger,et al.  Feature Reduction for Document Clustering and Classification , 2000 .