The impact of document structure on keyphrase extraction

Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Document structure may contain useful information about which parts or phrases of a document are important, but has rarely been considered as a source of information for keyphrase extraction. We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new, large corpus that consists of full-text journal articles, where the rich collection and document structure available at the publishing stage is explicitly annotated. We explore features based on the XML tags contained in the documents, and based on generic section types derived using position and cue words in section titles. For XML tags we find sections, abstract, and title to perform best, but many smaller elements may be beneficial in combination with other features. Of the generic section types, the discussion section is found to be most useful for keyphrase extraction.

[1]  Noriko Kando Text-Level Structure of Research Papers: Implications for Text-Based Information Processing Systems , 1997, BCS-IRSG Annual Colloquium on IR Research.

[2]  Roger M. Needham,et al.  The thesaurus approach to information retrieval , 1958 .

[3]  Ahmed A. Rafea,et al.  KP-Miner: A keyphrase extraction system for English and Arabic documents , 2009, Inf. Syst..

[4]  Norman Roberts,et al.  The pre‐History of the Information Retrieval Thesaurus , 1984, J. Documentation.

[5]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[6]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[7]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[8]  G. Crookes,et al.  Towards a Validated Analysis of Scientific Text Structure , 1986 .

[9]  Hong Peng,et al.  Keyphrases extraction from Web document by the least squares support vector machine , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[10]  Theresa Dirndorfer Anderson,et al.  Studying human judgments of relevance: interactions in context , 2006, IIiX.

[11]  Anette Hulth Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction , 2004 .

[12]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[13]  Andrew Dillon,et al.  Readers' Models of Text Structures: The Case of Academic Articles , 1991, Int. J. Man Mach. Stud..

[14]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[15]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..