Fast text categorization using concise semantic analysis

Text representation is a necessary procedure for text categorization tasks. Currently, bag of words (BOW) is the most widely used text representation method but it suffers from two drawbacks. First, the quantity of words is huge; second, it is not feasible to calculate the relationship between words. Semantic analysis (SA) techniques help BOW overcome these two drawbacks by interpreting words and documents in a space of concepts. However, existing SA techniques are not designed for text categorization and often incur huge computing cost. This paper proposes a concise semantic analysis (CSA) technique for text categorization tasks. CSA extracts a few concepts from category labels and then implements concise interpretation on words and documents. These concepts are small in quantity and great in generality and tightly related to the category labels. Therefore, CSA preserves necessary information for classifiers with very low computing cost. To evaluate CSA, experiments on three data sets (Reuters-21578, 20-NewsGroup and Tancorp) were conducted and the results show that CSA reaches a comparable micro- and macro-F"1 performance with BOW, if not better one. Experiments also show that CSA helps dimension sensitive learning algorithms such as k-nearest neighbor (kNN) to eliminate the ''Curse of Dimensionality'' and as a result reaches a comparable performance with support vector machine (SVM) in text categorization applications. In addition, CSA is language independent and performs equally well both in Chinese and English.

[1]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[2]  Feng-Chia Li,et al.  Combination of feature selection approaches with SVM in credit scoring , 2010, Expert Syst. Appl..

[3]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[4]  Amita Goyal Chin,et al.  Text databases & document management: theory & practice , 2001 .

[5]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[6]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Chunzhi Wang,et al.  Dimensionality Reduction in Webpage Categorization Using Probabilistic Latent Semantic Analysis and Adaptive General Particle Swarm Optimization , 2009, 2009 International Workshop on Intelligent Systems and Applications.

[8]  Moustafa Ghanem,et al.  A novel refinement approach for text categorization , 2005, CIKM '05.

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  M. Brand,et al.  Fast low-rank modifications of the thin singular value decomposition , 2006 .

[11]  Wei Wang,et al.  Text categorization based on combination of modified back propagation neural network and latent semantic analysis , 2009, Neural Computing and Applications.

[12]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[13]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[14]  Zhi-Hua Zhou,et al.  Distributional Features for Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[16]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[17]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[18]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[19]  Meng Wang,et al.  Metric learning with feature decomposition for image categorization , 2010, Neurocomputing.