Improving Text Classification Performance Using PCA and Recall-Precision Criteria

Persian text is usually associated with a wide range of important or useless features. This is the main reason why feature extraction process is one of the difficult tasks in the field of Persian text analysis and understanding. While few research works have focused on this problem, the aim of this paper is to introduce a novel approach for extracting the most relevant features and classification of Persian text. Experimental results show that utilizing the principle component analysis along with recall and precision criteria and employing term frequency and category relevancy factor can result in considerable improvement in running time of the classification process while accuracy and precision criteria are improved a little or are not decreased as much as affecting classification performance.

[1]  Hadi Amiri,et al.  Using OWA fuzzy operator to merge retrieval system results , 2007 .

[2]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[3]  Amir Hossein Jadidinejad,et al.  Local Cluster Analysis as a Basis for High-Precision Information Retrieval , 2008 .

[4]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Fattaneh Taghiyareh,et al.  Experiments with persian text compression for web , 2004, WWW Alt. '04.

[6]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[7]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[8]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[9]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[10]  Alireza Mokhtaripour,et al.  Introduction to a new Farsi stemmer , 2006, CIKM '06.

[11]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[15]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[16]  Farhad Oroumchian,et al.  N-gram and Local Context Analysis for Persian text retrieval , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[17]  Amir Nayyeri,et al.  FuFaIR: a Fuzzy Farsi Information Retrieval System , 2006, IEEE International Conference on Computer Systems and Applications, 2006..

[18]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[19]  Yiming Yang,et al.  A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts , 1992, COLING.

[20]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[21]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[22]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[23]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[24]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[25]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.