Rule-based word clustering for text classification

This paper introduces a rule-based, context-dependent word clustering method, with the rules derived from various domain databases and the word text orthographic properties. Besides significant dimensionality reduction, our experiments show that such rule-based word clustering improves by 8 the overall accuracy of extracting bibliographic fields from references, and by 18.32 on average the class-specific performance on the line classification of document headers.