Support vector machine for customized email filtering based on improving latent semantic indexing

Latent semantic indexing (LSI) is an important method for information retrieval (IR), in which we can automatically transform the original textual data to a smaller semantic space by take advantage of some of the implicit or latent higher-order structure in associations of words with customized objects, and it also has been successfully applied to text classification. LSI can resolve the problems of polysemy and synonymy, and can reduce noise in the raw document-term matrix. But LSI is not an optimal approach to text classification. Because LSI is a complete unsupervised method, which ignores categories discrimination, it often drops the performance of text classification when it is applied to the whole training documents. In this paper, in order to prevent the spreading of the unsolicited email and harmful message, under multi-languages (Chinese and English) circumstance we have developed a system based on customized email topic being filtered, and we represented topic in Latent Semantic model, and abstract features from predefined email categories and document categories in LSI method. It is able to filter and recognize customized or special unwanted Chinese and English emails in positive examples supervised learning approach. We propose an improving LSI to improve the classification performance by a separate single value decomposition (SVD) on the transformed local region of each category. We apply support vector machine (SVM) classification method to recognize and filter email based on text classifier. The result of the experiment showed that our approach is very effective and has a good filtering performance.

[1]  Jian-Yun Nie,et al.  A Latent Semantic Structure Model for Text Classification , 2003 .

[2]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[3]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[4]  Yan Huang Support Vector Machines for Text Categorization Based on Latent Semantic Indexing , 2003 .

[5]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[6]  Chris Ding,et al.  On the Use of Singular Value Decomposition for Text Retrieval , 2000 .

[7]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  ParallelArchitecturesK. J. Maschho,et al.  A Portable Implementation of ARPACKfor Distributed Memory , 1996 .

[12]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[13]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[14]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[15]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[16]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[17]  Peter W. Foltz Using latent semantic indexing for information filtering , 1990 .

[18]  Karl-Michael Schneider Learning to Filter Junk E-Mail from Positive and Unlabeled Examples , 2004, IJCNLP.

[19]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[20]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.