A Case Study in Using Linguistic Phrases for Text Categorization on the WWW

Most learning algorithms that are applied to text categorization problems rely on a bag-of-words document representation, i.e., each word occurring in the document is considered as a separate feature. In this paper, we investigate the use of linguistic phrases as input features for text categorization problems. These features are based on information extraction patterns that are generated and used by the AUTOSLOG-TS system. We present experimental results on using such features as background knowledge for two machine learning algorithms on a classification task on the WWW. The results show that phrasal features can improve the precision of learned theories at the expense of coverage.

[1]  Ellen Riloff,et al.  Extraction-based Text Categorization: Generating Domain-specific Role Relationships , 1999 .

[2]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[3]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[4]  SingerYoram,et al.  Context-sensitive learning methods for text categorization , 1999 .

[5]  Ellen Riloff,et al.  An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains , 1996, Artif. Intell..

[6]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[7]  William W. Cohen Learning Trees and Rules with Set-Valued Features , 1996, AAAI/IAAI, Vol. 1.

[8]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[9]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[10]  Ellen Riloff,et al.  Little words can make a big difference for text classification , 1995, SIGIR '95.

[11]  William W. Cohen Fast Eeective Rule Induction , 1995 .

[12]  Johannes Fürnkranz,et al.  Incremental Reduced Error Pruning , 1994, ICML.

[13]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[14]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[15]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.