Learning to Extract Keyphrases from Text

Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. The third set of experiments examines the performance of GenEx on the task of metadata generation, relative to the performance of Microsoft?s Word 97. The fourth and final set of experiments investigates the performance of GenEx on the task of highlighting, relative to Verity?s Search 97. The experimental results support the claim that a specialized learning algorithm (GenEx) can generate better keyphrases than a general-purpose learning algorithm (C4.5) and the non-learning algorithms that are used in commercial software (Word 97 and Search 97).

[1]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[2]  Chris D. Paice,et al.  Constructing literature abstracts by computer: Techniques and prospects , 1990, Inf. Process. Manag..

[3]  Ralph Grishman,et al.  A Production Rule System for Message Summarization , 1997, AAAI.

[4]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[5]  Gerard Salton,et al.  Syntactic Approaches to Automatic Book Indexing , 1988, ACL.

[6]  George R. Krupka SRA: Description of the SRA System as Used for MUC-6 , 1995, MUC.

[7]  John J. Grefenstette,et al.  Optimization of Control Parameters for Genetic Algorithms , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Alberto Muñoz,et al.  Compound Key Word Generation from Document Databases Using A Hierarchical Clustering ART Model , 1997, Intell. Data Anal..

[10]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[11]  Bruce Krulwich,et al.  Learning user information interests through extraction of semantically significant phrases , 1996 .

[12]  Frances C. Johnson,et al.  The application of linguistic processing to automatic abstract generation , 1997 .

[13]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[14]  Allen Ginsberg,et al.  A unified approach to automatic indexing and information retrieval , 1993, IEEE Expert.

[15]  Wendy G. Lehnert,et al.  Wrap-Up: a Trainable Discourse Module for Information Extraction , 1994, J. Artif. Intell. Res..

[16]  Lisa F. Rau,et al.  Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[17]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[18]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[19]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[20]  G Salton,et al.  Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts , 1994, Science.

[21]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[22]  Chi-Hong Leung,et al.  A Statistical Learning Approach to Automatic Indexing of Controlled Index Terms , 1997, J. Am. Soc. Inf. Sci..

[23]  Chris D. Paice,et al.  The identification of important concepts in highly structured technical papers , 1993, SIGIR.

[24]  Alberto Muòoz,et al.  Compound Key Word Generation from Document Databases Using A Hierarchical Clustering ART Model , 1997 .

[25]  Richard K. Belew,et al.  Exporting phrases: a statistical analysis of topical language , 1991 .

[26]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[27]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[28]  L. Breiman Arcing Classifiers , 1998 .

[29]  Sung-Hyon Myaeng,et al.  Development of a Document Summarization System for Effective Information Services , 1997, RIAO.

[30]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[31]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[32]  Hiroshi Nakagawa Extraction of Index Words from Manuals , 1997, RIAO.