论文信息 - Hierarchical document classification using automatically generated hierarchy

Hierarchical document classification using automatically generated hierarchy

This paper describes development of the automated industry and occupation coding system for the Korean Census records. The purpose of the system is to convert natural language responses on survey questionnaires into corresponding numeric codes according to standard code book from the Census Bureau. We employ kNN(k Nearest Neighbors)-based document classification method and information retrieval techniques to index and to weight index terms. In order to solve the description inconsistency of many respondents, we use nouns and phrases acquired from past census data. Using the data, we could estimate the nouns or phrases frequently used to describe a certain code. The Experimental results show that the past census data plays an important role in increasing code classification accuracy.

Hyeoncheol Kim | Heuiseok Lim | Shenghuo Zhu | M. Ogihara | Tao Li

[1] David L. Waltz,et al. Trading MIPS and memory for knowledge engineering , 1992, CACM.

[2] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .