Improving Farsi multiclass text classification using a thesaurus and two‐stage feature selection

The progressive increase of information content has recently made it necessary to create a system for automatic classification of documents. In this article, a system is presented for the categorization of multiclass Farsi documents that requires fewer training examples and can help to compensate the shortcoming of the standard training dataset. The new idea proposed in the present article is based on extending the feature vector by adding some words extracted from a thesaurus and then filtering the new feature vector by applying secondary feature selection to discard inappropriate features. In fact, a phase of secondary feature selection is applied to choose more appropriate features among the features added from a thesaurus to enhance the effect of using a thesaurus on the efficiency of the classifier. To evaluate the proposed system, a corpus is gathered from the Farsi Wikipedia website and some articles in the Hamshahri newspaper, the Roshd periodical, and the Soroush magazine. In addition to studying the role of a thesaurus and applying secondary feature selection, the effect of a various number of categories, size of the training dataset, and average number of words in the test data also are examined. As the results indicate, classification efficiency improves by applying this approach, especially when available data is not sufficient for some text categories. © 2011 Wiley Periodicals, Inc.

[1]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[2]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[3]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Heshaam Faili,et al.  Classification of Persian textual documents using learning vector quantization , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[6]  Yi Liu,et al.  FS_SFS: A novel feature selection method for support vector machines , 2006, Pattern Recognit..

[7]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[8]  Luis M. de Campos,et al.  Bayesian network models for hierarchical text classification from a thesaurus , 2009, Int. J. Approx. Reason..

[9]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[10]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[11]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[12]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[13]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[14]  Tai-Yue Wang,et al.  Fuzzy support vector machine for multi-class text categorization , 2007, Inf. Process. Manag..

[15]  Yoko Ino,et al.  Extracting Common Concepts from WordNet to Classify Documents , 2005, Artificial Intelligence and Applications.

[16]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[17]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Xin Song,et al.  Research of Chinese Text Classification Methods Based on Semantic Vector and Semantic Similarity , 2009, 2009 International Forum on Computer Science-Technology and Applications.

[20]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[21]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[22]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[23]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[24]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[25]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[26]  Masoud Rahgozar,et al.  Farsi Text Classification Using N-Grams and Knn Algorithm A Comparative Study , 2008, DMIN.

[27]  José Ranilla,et al.  Improving performance of text categorization by combining filtering and supportvector machines , 2004, J. Assoc. Inf. Sci. Technol..

[28]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.