A New Semi-supervised Dimension Reduction Technique for Textual Data Analysis

Dimension reduction techniques are important preprocessing algorithms for high dimensional applications that reduce the noise keeping the main structure of the dataset. They have been successfully applied to a large variety of problems and particularly in text mining applications. However, the algorithms proposed in the literature often suffer from a low discriminant power due to its unsupervised nature and to the ‘curse of dimensionality’. Fortunately several search engines such as Yahoo provide a manually created classification of a subset of documents that may be exploited to overcome this problem. In this paper we propose a semi-supervised version of a PCA like algorithm for textual data analysis. The new method reduces the term space dimensionality taking advantage of this document classification. The proposed algorithm has been evaluated using a text mining problem and it outperforms well known unsupervised techniques.

[1]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[2]  Anil K. Jain,et al.  Artificial neural networks for feature extraction and multivariate data projection , 1995, IEEE Trans. Neural Networks.

[3]  Alberto Muñoz,et al.  Extending the SOM Algorithm to Non-Euclidean Distances via the Kernel Trick , 2004, ICONIP.

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Jae Yun Lee,et al.  A corpus-based approach to comparative evaluation of statistical term association measures , 2001 .

[6]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[7]  Gene H. Golub,et al.  Matrix computations , 1983 .

[8]  Garrison W. Cottrell,et al.  Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[9]  Vladimir Cherkassky,et al.  Learning from data , 1998 .

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  Charu C. Aggarwal,et al.  Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[12]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[13]  Ludovic Lebart,et al.  Exploring Textual Data , 1997 .

[14]  A. Buja,et al.  Inequalities and Positive-Definite Functions Arising from a Problem in Multidimensional Scaling , 1994 .

[15]  M. Martin-Merino,et al.  A new Sammon algorithm for sparse data visualization , 2004, ICPR 2004.