A Hierarchical Classification Model for Document Categorization

We propose a novel hierarchical classification method for documents categorization in this paper. The approach consists of multiple levels of classification for different hierarchies. Regularized Least Square (RLS)binary classifiers are applied in the middle levels of the hierarchy to classify documents into smaller set of categories and K-nearest-neighbor (KNN) multi-class classifiers are used at the bottom to classify documents into final classes. Experiments on large-scale real world tax documents show that the proposed hierarchical approach outperforms traditional flat classification method.

[1]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[4]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[5]  Naohiro Furukawa,et al.  Form reading based on form-type identification and form-data recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Prateek Sarkar Image classification: Classifying distributions of visual features , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[8]  Azriel Rosenfeld,et al.  Classification of document pages using structure-based features , 2001, International Journal on Document Analysis and Recognition.