Linguistic Techniques to Improve the Performance of Automatic Text Categorization

This paper presents a method for incorporating natural language processing into existing text categorization procedures. Three aspects are considered in the investigation: (i) a method for weighting terms based on the concept of a probability weighted amount of information, (ii) estimation of term occurrence probabilities using a probabilistic language model, and (iii) automatic extraction of terms based on POS tags automatically generated by a morphological analyzer. The effects of these considerations are examined in the experiments using Reuters21578 and NTCIR-J1 standard test collections.