Automatic extraction of document keyphrases for use in digital libraries: Evaluation and applications

This article describes an evaluation of the Kea automatic keyphrase extraction algorithm. Document keyphrases are conventionally used as concise descriptors of document content, and are increasingly used in novel ways, including document clustering, searching and browsing interfaces, and retrieval engines. However, it is costly and time consuming to manually assign keyphrases to documents, motivating the development of tools that automatically perform this function. Previous studies have evaluated Kea's performance by measuring its ability to identify author keywords and keyphrases, but this methodology has a number of well-known limitations. The results presented in this article are based on evaluations by human assessors of the quality and appropriateness of Kea keyphrases. The results indicate that, in general, Kea produces keyphrases that are rated positively by human assessors. However, typical Kea settings can degrade performance, particularly those relating to keyphrase length and domain specificity. We found that for some settings, Kea's performance is better than that of similar systems, and that Kea's ranking of extracted keyphrases is effective. We also determined that author-specified keyphrases appear to exhibit an inherent ranking, and that they are rated highly and therefore suitable for use in training and evaluation of automatic keyphrasing systems.

[1]  Alan F. Smeaton,et al.  User-Chosen Phrases in Interactive Query Formulation for Information Retrieval , 1998, BCS-IRSG Annual Colloquium on IR Research.

[2]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[3]  Bruce Krulwich,et al.  The InfoFinder Agent: Learning User Interests through Heuristic Phrase Extraction , 1997, IEEE Expert.

[4]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[5]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[6]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[7]  Gary Perlman The HCI Bibliography project , 1991 .

[8]  Mark S. Staveley,et al.  Phrasier: a system for interactive document retrieval using keyphrases , 1999, SIGIR '99.

[9]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[10]  Atro Voutilainen,et al.  NPtool, a Detector of English Noun Phrases , 1995, VLC@ACL.

[11]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[12]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[13]  Malika Mahoui,et al.  Hierarchical document clustering using automatically extracted keyphrases , 2000 .

[14]  Avi Arampatzis,et al.  Phase-Based Information Retrieval , 1998, Inf. Process. Manag..

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[17]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[18]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[19]  Ian H. Witten,et al.  A public library based on full-text retrieval , 1998, CACM.

[20]  Ian H. Witten,et al.  The New Zealand Digital Library: Collections and experience , 1997 .

[21]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.

[22]  Gordon W. Paynter,et al.  Topic-based browsing within a digital library using keyphrases , 1999, DL '99.

[23]  Kuang-hua Chen,et al.  Automatic Identification of Subjects for Textual Documents in Digital Libraries , 1999, ArXiv.

[24]  Steve Jones Design and Evaluation of Phrasier, an Interactive System for Linking Documents Using Keyphrases , 1999, INTERACT.

[25]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[26]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[27]  Shivakumar Vaithyanathan,et al.  Exploiting clustering and phrases for context-based information retrieval , 1997, SIGIR '97.

[28]  Hsinchun Chen,et al.  Comparing noun phrasing techniques for use with medical digital library tools , 2000 .

[29]  Ian H. Witten,et al.  Managing Complexity in a Distributed Digital Library , 1999, Computer.

[30]  Dana J. Vanier,et al.  Use of Keyphrase Extraction Software for Creation of an AEC/FM Thesaurus , 2000, J. Inf. Technol. Constr..

[31]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[32]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.