Automating keyphrase extraction with multi-objective genetic algorithms

Keyphrases have been used extensively in IR systems to facilitate information exchange, organize information and assist information retrieval. Automation of keyphrase generation is essential for the timely creation of keyphrases for large repositories in new domains where previous thesauri do not exist or for metacollections in which keyphrases that are meaningful across disparate collections are needed. In this paper we propose an automated keyphrase extraction algorithm using a non-dominated sorting multi-objective genetic algorithm. The "clumping" property of keyphrases is used to judge the appropriateness of a phrase and is quantified by a condensation clustering measure proposed by Bookstein. The objective is to find the smallest phrase set that has the best precision, as measured by average condensation clustering. Keyphrases were retrieved from a collection of design conference papers and the results were presented to domain experts for evaluation. Ninety percent of the generated phrases were deemed appropriate for use in a thesaurus for engineering design.

[1]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[2]  Bruce Krulwich,et al.  The InfoFinder Agent: Learning User Interests through Heuristic Phrase Extraction , 1997, IEEE Expert.

[3]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[4]  Wei Chen,et al.  The Engineering Design Discipline: Is its Confounding Lexicon Hindering its Evolution? , 2000 .

[5]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[6]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[7]  Ralph Grishman,et al.  A Corpus-based Probabilistic Grammar with Only Two Non-terminals , 1995, IWPT.

[8]  Kalyanmoy Deb,et al.  A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II , 2000, PPSN.

[9]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[10]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[11]  Peter G. Anick,et al.  The paraphrase search assistant: terminological feedback for iterative information seeking , 1999, SIGIR '99.

[12]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[13]  Ian H. Witten,et al.  Browsing in digital libraries: a phrase-based approach , 1997, DL '97.

[14]  Justin Picard,et al.  Finding content-bearing terms using term similarities , 1999, EACL.

[15]  Shmuel Tomi Klein,et al.  Clumping Properties of Content-Bearing Words , 1998, J. Am. Soc. Inf. Sci..

[16]  Fredric C. Gey,et al.  Mapping Entry Vocabulary to Unfamiliar Metadata Vocabularies , 1999, D Lib Mag..

[17]  Mark S. Staveley,et al.  Phrasier: a system for interactive document retrieval using keyphrases , 1999, SIGIR '99.