A Domain Independent Double Layered Approach to Keyphrase Generation

The annotation of documents and web pages with semantic metatdata is an activity that can greatly increase the accuracy of Information Retrieval and Personalization systems, but the growing amount of text data available is too large for an extensive manual process. On the other hand, automatic keyphrase generation, a complex task involving Natural Language Processing and Knowledge Engineering, can significantly support this activity. Several different strategies have been proposed over the years, but most of them require extensive training data, which are not always available, suffer high ambiguity and differences in writing style, are highly domainspecific, and often rely on a well-structured knowledge that is very hard to acquire and encode. In order to overcome these limitations, we propose in this paper an innovative domain-independent approach that consists of an unsupervised keyphrase extraction phase and a subsequent keyphrase inference phase based on loosely structured, collaborative knowledge such as Wikipedia, Wordnik, and Urban Dictionary. This double layered approach allows us to generate keyphrases that both describe and classify the text.

[1]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[2]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[3]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[4]  Jaime G. Carbonell,et al.  Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization , 2012, LREC.

[5]  B. Magnini,et al.  Keyphrase Extraction for Summarization Purposes : The LAKE System at DUC-2004 , 2004 .

[6]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[7]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[8]  Kamal Sarkar A Hybrid Approach to Extract Keyphrases from Medical Documents , 2013, ArXiv.

[9]  Jiawei Han,et al.  KERT: Automatic Extraction and Ranking of Topical Keyphrases from Content-Representative Document Titles , 2013, ArXiv.

[10]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[11]  F. Ren,et al.  Multilingual single document keyword extraction for information retrieval , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[12]  Carlo Tasso,et al.  Personalized Access to Scientific Publications: from Recommendation to Explanation , 2013, UMAP.

[13]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[14]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[15]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[16]  Antonina Dattolo,et al.  Automatic keyphrase extraction and ontology mining for content-based tag recommendation , 2010 .

[17]  Mark Last,et al.  Graph-Based Keyword Extraction for Single-Document Summarization , 2008, COLING 2008.

[18]  Carlo Tasso,et al.  Extracting Keyphrases from Web Pages , 2012, IRCDL.

[19]  Carlo Tasso,et al.  Integrating semantic relatedness in a collaborative filtering system , 2012, Mensch & Computer Workshopband.

[20]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.