A Comparative Study of Features for Keyphrase Extraction in Scientific Literature

Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Many different features have been suggested, and have been used individually or in combination. However, it is not clear which of these features are most informative for this task. We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new corpus that consists of fulltext journal articles and is substantially larger than data sets used in previous work. In addition, the rich collection and document structure available at the publishing stage is explicitly annotated. We suggest new features based on this structure and compare them to existing features, analyzing how the different features capture different aspects the keyphrase extraction task.

[1]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[2]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[3]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[4]  Anette Hulth Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction , 2004 .

[5]  Ian H. Witten,et al.  Domain-independent automatic keyphrase indexing with small training sets , 2008, J. Assoc. Inf. Sci. Technol..

[6]  Hong Peng,et al.  Keyphrases extraction from Web document by the least squares support vector machine , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[7]  Noriko Kando Text-Level Structure of Research Papers: Implications for Text-Based Information Processing Systems , 1997, BCS-IRSG Annual Colloquium on IR Research.

[8]  Tao Qin,et al.  Feature selection for ranking , 2007, SIGIR.

[9]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[10]  G. Crookes,et al.  Towards a Validated Analysis of Scientific Text Structure , 1986 .

[11]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[12]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[13]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[14]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[15]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[16]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[17]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.

[18]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[19]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[20]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[21]  Theresa Dirndorfer Anderson,et al.  Studying human judgments of relevance: interactions in context , 2006, IIiX.

[22]  Adolfo Alonso Arroyo,et al.  Keywords given by authors of scientific articles in database descriptors , 2007, J. Assoc. Inf. Sci. Technol..

[23]  Ahmed A. Rafea,et al.  KP-Miner: A keyphrase extraction system for English and Arabic documents , 2009, Inf. Syst..

[24]  Norman Roberts,et al.  The pre‐History of the Information Retrieval Thesaurus , 1984, J. Documentation.

[25]  Peter D. Turney Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[26]  Peer Bork,et al.  Computing fuzzy associations for the analysis of biological literature. , 2002, BioTechniques.

[27]  Rada Mihalcea,et al.  Investigations in Unsupervised Back-of-the-Book Indexing , 2007, FLAIRS.

[28]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[29]  Roger M. Needham,et al.  The thesaurus approach to information retrieval , 1958 .

[30]  Andrew Dillon,et al.  Readers' Models of Text Structures: The Case of Academic Articles , 1991, Int. J. Man Mach. Stud..