Keyword Extraction Using Support Vector Machine

This paper is concerned with keyword extraction. By keyword extraction, we mean extracting a subset of words/phrases from a document that can describe the ‘meaning' of the document. Keywords are of benefit to many text mining applications. However, a large number of documents do not have keywords and thus it is necessary to assign keywords before enjoying the benefit from it. Several research efforts have been done on keyword extraction. These methods make use of the ‘global context information', which makes the performance of extraction restricted. A thorough and systematic investigation on the issue is thus needed. In this paper, we propose to make use of not only ‘global context information', but also ‘local context information' for extracting keywords from documents. As far as we know, utilizing both ‘global context information' and ‘local context information' in keyword extraction has not been sufficiently investigated previously. Methods for performing the tasks on the basis of Support Vector Machines have also been proposed in this paper. Features in the model have been defined. Experimental results indicate that the proposed SVM based method can significantly outperform the baseline methods for keyword extraction. The proposed method has been applied to document classification, a typical text mining processing. Experimental results show that the accuracy of document classification can be significantly improved by using the keyword extraction method.

[1]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[2]  Yanchun Zhang,et al.  Advanced Web Technologies and Applications , 2004, Lecture Notes in Computer Science.

[3]  Tat-Seng Chua,et al.  Comparing Keyword Extraction Techniques for WEBSOM Text Archives , 2002, Int. J. Artif. Intell. Tools.

[4]  Anette Hulth Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction , 2004 .

[5]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[6]  Juan-Zi Li,et al.  Loss Minimization Based Keyword Distillation , 2004, APWeb.

[7]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[8]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[9]  Eric Brill,et al.  Man* vs. Machine: A Case Study in Base Noun Phrase Learning , 1999, ACL.

[10]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Cai Qingsheng,et al.  Automatic keywords extraction of Chinese document using small world structure , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[13]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[14]  Changning Huang,et al.  A Unified Statistical Model for the Identification of English BaseNP , 2000, ACL.

[15]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[16]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[17]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[18]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.