论文信息 - Rule-based word clustering for text classification

Rule-based word clustering for text classification

This paper introduces a rule-based, context-dependent word clustering method, with the rules derived from various domain databases and the word text orthographic properties. Besides significant dimensionality reduction, our experiments show that such rule-based word clustering improves by 8 the overall accuracy of extracting bibliographic fields from references, and by 18.32 on average the class-specific performance on the line classification of document headers.

Hui Han | Hongyuan Zha | C. Lee Giles | Eren Manavoglu

[1] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[2] Inderjit S. Dhillon,et al. A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[3] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4] Roni Rosenfeld,et al. Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[5] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[6] Andrew McCallum,et al. Distributional clustering of words for text classification , 1998, SIGIR '98.