A High Performance Two-Class Chinese Text Categorization Method
暂无分享,去创建一个
Text filtering for topic-sensitive information is one of the important applications in text categorization.To effectively filter out the topic-sensitive information from Chinese text collections is a technical challenge.This paper presents a high performance method employing a two-step strategy to classify texts.In the first step,authors regard the words with parts of speech verb,noun,adjective and adverb as candidate features,perform feature selection on them in terms of the improved mutual information formula,and classify the input texts with a naive Bayes classifier.A portion of texts which are currently thought of being unreliable in categorization are identified,forming a fuzzy area between categories.In the second step,authors regard the bigrams of words with parts of speech verb and noun as candidate features,use the same feature selection and classifier to deal with the texts in the fuzzy area.The experiments on a test set consisting of 12600 Chinese texts show that this method achieves a high performance.The precision,recall and F_(1)is 97.19%,93.94% and 95.54% respectively.