Classifying web documents using term spectral transforms and Multi-Dimensional Latent Semantic representation

This research investigates the potential of document semantic representation considering both term frequencies and term associations. In particular, we proposed a general framework of the use of term spectra to represent term spatial distributions and associations through a document. The term spectra we explored involved the use of three typical techniques: Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT), and Discrete Wavelet Transform (DWT). A term affinity graph was established to represent each document. We then employed a new document analysis method (recently developed by authors), named Multi-Dimensional Latent Semantic Analysis (MDLSA), which enables us to formulate an efficient semantic representation of a document based on the term affinity graph. Our algorithm was examined in the application of Web document classification. Experimental results demonstrate that the proposed technique not only gains much computational efficiency compared to Direct Graph Matching (DGM), but also outperforms the state-of-art algorithms such as VSM, PCA, RAP, and MLM.

[1]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[2]  Marimuthu Palaniswami,et al.  Fourier domain scoring: a novel document ranking method , 2004, IEEE Transactions on Knowledge and Data Engineering.

[3]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  Kari Torkkola,et al.  Linear Discriminant Analysis in Document Classification , 2007 .

[5]  Tommy W. S. Chow,et al.  A coarse-to-fine framework to efficiently thwart plagiarism , 2011, Pattern Recognit..

[6]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[7]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Yousef Saad,et al.  Orthogonal Neighborhood Preserving Projections: A Projection-Based Dimensionality Reduction Technique , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Tommy W. S. Chow,et al.  A multi-level matching method with hybrid similarity for document retrieval , 2012, Expert Syst. Appl..

[10]  Marimuthu Palaniswami,et al.  A novel document ranking method using the discrete cosine transform , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Alejandro F. Frangi,et al.  Two-dimensional PCA: a new approach to appearance-based face representation and recognition , 2004 .

[12]  Marimuthu Palaniswami,et al.  A novel document retrieval method using the discrete wavelet transform , 2005, TOIS.

[13]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[14]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[15]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Peter V. Gehler,et al.  The rate adapting poisson model for information retrieval and object recognition , 2006, ICML.

[18]  Abraham Kandel,et al.  Classification Of Web Documents Using Graph Matching , 2004, Int. J. Pattern Recognit. Artif. Intell..

[19]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[20]  Osmar R. Zaïane,et al.  Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[21]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  Kuanquan Wang,et al.  Bidirectional PCA with assembled matrix distance metric for image recognition , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[26]  Tommy W. S. Chow,et al.  A new document representation using term frequency and vectorized graph connectionists with application to document retrieval , 2009, Expert Syst. Appl..

[27]  Masao Fuketa,et al.  A document classification method by using field association words , 2000, Inf. Sci..

[28]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[29]  Yunming Ye,et al.  Multidimensional Latent Semantic Analysis Using Term Spatial Information , 2013, IEEE Transactions on Cybernetics.