Statistical Phrases in Automated Text Categorization

In this work we investigate the usefulness of {\em $n$-grams} for document indexing in text categorization (TC). We call $n$-gram a set $t_k$ of $n$ word stems, and we say that $t_k$ occurs in a document $d_j$ when a sequence of words appears in $d_j$ that, after stop word removal and stemming, consists exactly of the $n$ stems in $t_k$, in some order. Previous researches have investigated the use of $n$-grams (or some variant of them) in the context of specific learning algorithms, and thus have not obtained general answers on their usefulness for TC. In this work we investigate the usefulness of $n$-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all $\alpha$-grams ($\alpha\leq n$), and checking how many $n$-grams score high enough to be selected in the top $\sigma$ $\alpha$-grams. We report the results of our experiments, using several feature selection functions and varying values of $\sigma$, performed on the {\sf Reuters-21578} standard TC benchmark. We also report results of making actual use of the selected $n$-grams in the context of a linear classifier induced by means of the Rocchio method.

[1]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[4]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[5]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorisation: a survey , 1999 .

[7]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[8]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[9]  Chantal Enguehard,et al.  Automatic Natural Acquisition of a Terminology , 1995, J. Quant. Linguistics.

[10]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[11]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[12]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[13]  Alistair Moffat,et al.  Statistical phrases for vector-space information retrieval (poster abstract) , 1999, SIGIR '99.

[14]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[15]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[16]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[17]  J. J. Lee,et al.  Testing the maximum entropy principle for information retrieval , 1998 .

[18]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[19]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[20]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[21]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[22]  Dunja Mladenic,et al.  Word sequences as features in text-learning , 1998 .

[23]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[24]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[25]  SmadjaFrank Retrieving collocations from text , 1993 .

[26]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[27]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[28]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[29]  Chris Buckley,et al.  Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[30]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[31]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[32]  Joe Zhou,et al.  Phrasal Terms in Real-World IR Applications , 1999 .

[33]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[34]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.