A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language

Associating meaningful keyphrases to text documents and Web pages is an activity that can significantly increase the accuracy of Information Retrieval, Personalization and Recommender systems, but the growing amount of text data available is too large for an extensive manual annotation. On the other hand, automatic keyphrase generation can significantly support this activity. This task is already performed with satisfactory results by several systems proposed in the literature, however, most of them focuses solely on the English language which represents approximately more than 50% of Web contents. Only few other languages have been investigated and Italian, despite being the ninth most used language on the Web, is not among them. In order to overcome this shortage, we propose a novel multi-language, unsupervised, knowledge-based approach towards keyphrase generation. To support our claims, we developed DIKpE-G, a prototype system which integrates several kinds of knowledge for selecting and evaluating meaningful keyphrases, ranging from linguistic to statistical, meta/structural, social, and ontological knowledge. DIKpE-G performs well over English and

[1]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[2]  Ahmed A. Rafea,et al.  KP-Miner: A keyphrase extraction system for English and Arabic documents , 2009, Inf. Syst..

[3]  Mark Last,et al.  A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm , 2010, ACL.

[4]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[5]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[6]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[7]  Joel L. Fagan,et al.  Automatic phrase indexing for document retrieval , 1987, SIGIR '87.

[8]  B. Magnini,et al.  Keyphrase Extraction for Summarization Purposes : The LAKE System at DUC-2004 , 2004 .

[9]  Ashish Verma,et al.  A Language Independent Approach to Audio Search , 2011, INTERSPEECH.

[10]  Carlo Tasso,et al.  A Domain Independent Double Layered Approach to Keyphrase Generation , 2014, WEBIST.

[11]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[12]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[13]  Jiawei Han,et al.  KERT: Automatic Extraction and Ranking of Topical Keyphrases from Content-Representative Document Titles , 2013, ArXiv.

[14]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[15]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[16]  Antonina Dattolo,et al.  A New Domain Independent Keyphrase Extraction System , 2010, IRCDL.

[17]  Maurizio Marchese,et al.  Unsupervised key-phrases extraction from scientific papers using domain and linguistic knowledge , 2008, 2008 Third International Conference on Digital Information Management.

[18]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[19]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.