A comprehensive analysis of using semantic information in text categorization

Traditional text categorization methods only deal with the content of the documents and use some statistic based metrics to represent the documents. The representation is then used by a machine learning approach to determine the document class. In this picture, the meaning of the document is missing. In order to add meaning into the text categorization process, we start with using part-of-speech tagging (POS). As expected, in a document each part-of-speech tag does not contribute the same amount of information to the document meaning. In addition to the POS information, we make use of WordNet to add semantic features such as synonyms, hypernyms, hyponyms, meronyms and topics into classification process. Using WordNet's semantic features introduces ambiguity and not all semantic features are really related to the document content. To overcome this problem, we introduce a new method to eliminate the ambiguity. Various combinations of POS, WordNet and word sense disambiguation are applied and the results show that using semantic features perform better than the traditional, context based methods.

[1]  Jianqiang Li,et al.  Fully Automatic Text Categorization by Exploiting WordNet , 2009, AIRS.

[2]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[3]  Narayanan Kulathuramaiyer,et al.  Semantic Feature Selection Using WordNet , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  Robert J. Hilderman,et al.  Evaluating WordNet Features in Text Classification Models , 2006, FLAIRS Conference.

[6]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Paolo Rosso,et al.  The Influence of Semantics in Text Categorisation: A Comparative Study using the k Nearest Neighbours Method , 2005, IICAI.

[9]  Lee S. Jensen,et al.  Improving Text Classification by Using Conceptual and Contextual Features , 2000 .

[10]  Abdellatif Rahmoun,et al.  Using WordNet for Text Categorization , 2008, Int. Arab J. Inf. Technol..

[11]  Bin Wang,et al.  A Wordnet-Based Approach to Feature Selection in Text Categorization , 2004, Intelligent Information Processing.

[12]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[13]  Khaled Rasheed,et al.  Comparison of the Effects of Morphological and Ontological Information on Text Categorization , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[14]  Mourad Oussalah,et al.  A semantic-based text classification system , 2010, 2010 IEEE 9th International Conference on Cyberntic Intelligent Systems.