Feature annotation for text categorization

In text categorization, feature extraction is one of the major strategies that aim at making text classifiers more efficient and accurate. Selecting quickly a suitable strategy for feature extraction out of many strategies proposed by previous studies is difficult. In this paper, we propose an efficient entity extraction approach for feature extraction which contributes towards accurate text categorization. In the proposed approach the entities identified are person name, organization name, location and date. We have used the GATE tool for extraction of these entities. After the entities are identified we have annotated each of these entities in the original text with parameters. There are three measures used for feature selection, term frequency (TF), information gain (IG) and chi-square (χ2). The effectiveness and accuracy of the entity annotated features is judged by using these features for classification and comparing the results against the non-annotated features. The experimentation is performed on standard benchmarking datasets such as NFS Abstract datasets and Reuters-21578. The experimental results predict that the accuracy of text categorization using the annotated features is better for NFS Abstract-Title dataset as compared to non-annotated features. For Reuters-21578, however, there wasn't a significant improvement in accuracy of classification.

[1]  Yi Guo,et al.  Automatic text categorization based on content analysis with cognitive situation models , 2010, Inf. Sci..

[2]  M. F. Zaiyadi,et al.  A Proposed Hybrid Approach for Feature Selection in Text Document Categorization , 2010 .

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Kalina Bontcheva,et al.  Developing reusable and robust language processing components for information systems using GATE , 2002, Proceedings. 13th International Workshop on Database and Expert Systems Applications.

[5]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[6]  Narayanan Kulathuramaiyer,et al.  Semantic Feature Selection Using WordNet , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[7]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[8]  Jun Zhang,et al.  Keyword Combination Extraction in Text Categorization Based on Ant Colony Optimization , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[9]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[10]  Wang Xiaoyue,et al.  Applying RDF Ontologies to Improve Text Classification , 2009, 2009 International Conference on Computational Intelligence and Natural Computing.

[11]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[12]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[13]  Keno Buss Literature Review on Preprocessing for Text Mining , 2007 .