论文信息 - Web Document Classification Based on Hangeul Morpheme and Keyword Analyses

Web Document Classification Based on Hangeul Morpheme and Keyword Analyses

With the current development of high speed Internet and massive database technology, the amount of web documents increases rapidly, and thus, classifying those documents automatically is getting important. In this study, we propose an effective method to extract document features based on Hangeul morpheme and keyword analyses, and to classify non-structured documents automatically by predicting subjects of those documents. To extract document features, first, we select terms using a morpheme analyzer, form the keyword set based on term frequency and subject-discriminating power, and perform the scoring for each keyword using the discriminating power. Then, we generate the classification model by utilizing the commercial software that implements the decision tree, neural network, and SVM(support vector machine). Experimental results show that the proposed feature extraction method has achieved considerable performance, i.e., average precision 0.90 and recall 0.84 in case of the decision tree, in classifying the web documents by subjects.

Won-Sik Choi | Seok-Lyong Lee | Dan-Ho Park | Hong-Jo Kim

[1] Ki-Youn Sung,et al. Topic based Web Document Clustering using Named Entities , 2010 .

[2] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[3] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5] Thanaruk Theeramunkong. Applying passage in Web text mining , 2004, Int. J. Intell. Syst..

[6] Wang Huizhen,et al. Automatic word clustering for text categorization using global information , 2004 .

[7] Patrick Gallinari,et al. HMM-based passage models for document classification and ranking , 2001 .

[8] Young-Joon Nam,et al. A Study on Automatic Text Categorization of Web-Based Query Using Synonymy List , 2004 .

[9] EunKyung Chung,et al. A Semantic-Based Feature Expansion Approach for Improving the Effectiveness of Text Categorization by Using WordNet , 2009 .