论文信息 - Macro Features Based Text Categorization

Macro Features Based Text Categorization

Text Categorization (TC) is one of the key techniques in web information processing. A lot of approaches have been proposed to do TC; most of them are based on the text representation using the distributions and relationships of terms, few of them take the document level relationships into account. In this paper, the document level distributions and relationships are used as a novel type features for TC. We called them macro features to differentiate from term based features. Two methods are proposed for macro features extraction. The first one is semi-supervised method based on document clustering technique. The second one constructs the macro feature vector of a text using the centroid of each text category. Experiments conducted on standard corpora Reuters-21578 and 20-newsgroup, show that the proposed methods can bring great performance improvement by simply combining macro features with classical term based features.

[1] Andrew McCallum,et al. Distributional clustering of words for text classification , 1998, SIGIR '98.

[2] Shehroz S. Khan,et al. Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[3] Naftali Tishby,et al. The Power of Word Clusters for Text Classification , 2006 .

[4] Chu-Ren Huang,et al. A Framework of Feature Selection Methods for Text Categorization , 2009, ACL.

[5] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[7] Narayanan Kulathuramaiyer,et al. An Empirical Study of Feature Selection for Text Categorization based on Term Weightage , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[8] Yiming Yang,et al. An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[9] Dong-Hong Ji,et al. A Semi-Supervised Feature Clustering Algorithm with Application to Word Sense Disambiguation , 2005, HLT.

[10] Minyi Guo,et al. A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[11] Songbo Tan,et al. Using hypothesis margin to boost centroid text classifier , 2007, SAC '07.