A Novel Kernel for Text Classification Based on Semantic and Statistical Information

In text categorization, a document is usually represented by a vector space model which can accomplish the classification task, but the model cannot deal with Chinese synonyms and polysemy phenomenon. This paper presents a novel approach which takes into account both the semantic and statistical information to improve the accuracy of text classification. The proposed approach computes semantic information based on HowNet and statistical information based on a kernel function with class-based weighting. According to our experimental results, the proposed approach could achieve state-of-the-art or competitive results as compared with traditional approaches such as the k-Nearest Neighbor (KNN), the Naive Bayes and deep learning models like convolutional networks.

[1]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[2]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[3]  T. Theeramunkong,et al.  Analysis of inverse class frequency in centroid-based text classification , 2004, IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004..

[4]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[5]  Banu Diri,et al.  A Semantic Kernel for Text Classification Based on Iterative Higher-Order Relations between Words and Documents , 2014, ICAISC.

[6]  Pei-Ying Zhang A HowNet-Based Semantic Relatedness Kernel for Text Classification , 2013 .

[7]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[10]  Si Wu,et al.  Improving support vector machine classifiers by modifying kernel functions , 1999, Neural Networks.

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  Nicholas E Evangelopoulos,et al.  Latent semantic analysis. , 2013, Wiley interdisciplinary reviews. Cognitive science.

[13]  Banu Diri,et al.  A simple semantic kernel approach for SVM using higher-order paths , 2014, 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) Proceedings.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Wu Li-de,et al.  Semantic Orientation Computing Based on HowNet , 2006 .

[17]  Banu Diri,et al.  A novel semantic smoothing kernel for text classification with class-based weighting , 2015, Knowl. Based Syst..

[18]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[19]  Banu Diri,et al.  A new method for attribute extraction with application on text classification , 2009, 2009 Fifth International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control.

[20]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.

[21]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[22]  Banu Diri,et al.  A novel higher-order semantic kernel for text classification , 2013, 2013 International Conference on Electronics, Computer and Computation (ICECCO).

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[25]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[26]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  Simon Parsons Introduction to Machine Learning by Ethem Alpaydin, MIT Press, 0-262-01211-1, 400 pp , 2005, Knowl. Eng. Rev..

[29]  Gerhard Weikum,et al.  Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification , 2005, PKDD.

[30]  Iraklis Varlamis,et al.  A Knowledge-Based Semantic Kernel for Text Classification , 2011, SPIRE.

[31]  Murat Can Ganiz,et al.  A corpus-based semantic kernel for text classification by using meaning values of terms , 2015, Eng. Appl. Artif. Intell..

[32]  Banu Diri,et al.  Abstract feature extraction for text classification , 2012, Turkish Journal of Electrical Engineering and Computer Sciences.