Filtering Contents with Bigrams and Named Entities to Improve Text Classification

We present a new method for the classification of “noisy” documents, based on filtering contents with bigrams and named entities. The method is applied to call for tender documents, but we claim it would be useful for many other Web collections, which also contain non-topical contents. Different variations of the method are discussed. We obtain the best results by filtering out a window around the least relevant bigrams. We find a significant increase of the micro-F1 measure on our collection of call for tenders, as well as on the “4-Universities” collection. Another approach, to reject sentences based on the presence of some named entities, also shows a moderate increase. Finally, we try combining the two approaches, but do not get conclusive results so far.

[1]  Richard Kittredge,et al.  Sublanguage : studies of language in restricted semantic domains , 1982 .

[2]  Douglas Biber,et al.  Using Register-Diversified Corpora for General Language Studies , 1993, Comput. Linguistics.

[3]  Sergei Nirenburg,et al.  Automatic Translation and the Concept of Sublanguage , 2003 .

[4]  Le Zhang,et al.  Filtering Junk Mail with a Maximum Entropy Model , 2003 .

[5]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[6]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[7]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[8]  Jian-Yun Nie,et al.  MBOI : Un outil pour la veille d'opportunités sur l'Internet , 2005 .

[9]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[10]  John Lehrberger,et al.  Automatic Translation and the Concept of Sublanguage , 1982 .

[11]  Masaki Murata,et al.  Sentence Extraction System Assembling Multiple Evidence , 2001, NTCIR.

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Martin Jansche Named Entity Extraction with Conditional Markov Models and Classifiers , 2002, CoNLL.

[14]  Constantin Orasan,et al.  A Comparison of Summarisation Methods Based on Term Specificity Estimation , 2004, LREC.

[15]  Jian-Yun Nie,et al.  Étude sur l'impact du sous-langage dans la classification automatique d'appels d'offres , 2005, CORIA.

[16]  Patrick Gallinari,et al.  HMM-based passage models for document classification and ranking , 2001 .

[17]  W. B. Cavnar,et al.  N-Gram-Based Text Filtering For TREC-2 , 1993, TREC.

[18]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..