Experimenting N-Grams in Text Categorization

This paper deals with automatic supervised classification of documents. The approach suggested is based on a vector representation of the documents centred not on the words but on the n-grams of characters for varying n. The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine and Kullback&Liebler distances, and two benchmark corpuses the reuters-21578 newswire articles and the 20 newsgroups data for evaluation. The evaluation was done, by using the macroaveraged F1 function. The results show the effectiveness of this approach compared to the Bag-OfWord and stem representations.

[1]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Xiaogang Peng,et al.  Document Classifications based on Word Semantic Hierarchies , 2005, Artificial Intelligence and Applications.

[4]  Ido Dagan,et al.  Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[5]  Thomas Hofmann,et al.  ProbMap - A probabilistic approach for mapping large document collections , 2000, Intell. Data Anal..

[6]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[7]  Renato De Mori,et al.  A fuzzy decision strategy for topic identification and dynamic selection of language models , 2000, Signal Process..

[8]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[9]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[10]  Shuigeng Zhou,et al.  Chinese Documents Classification Based on N-Grams , 2002, CICLing.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Alain Lelu,et al.  Consultation " floue " de grandes listes de formes lexicales simples et composées : un outil préparatoire pour l'analyse de grands corpus textuels. , 2000 .

[13]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[14]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[15]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .