Simple Unsupervised Keyphrase Extraction using Sentence Embeddings

Keyphrase extraction is the task of automatically selecting a small set of phrases that best describe a given free text document. Supervised keyphrase extraction requires large amounts of labeled training data and generalizes very poorly outside the domain of the training data. At the same time, unsupervised systems have poor accuracy, and often do not generalize well, as they require the input document to belong to a larger corpus also given as input. Addressing these drawbacks, in this paper, we tackle keyphrase extraction from single documents with EmbedRank: a novel unsupervised method, that leverages sentence embeddings. EmbedRank achieves higher F-scores than graph-based state of the art systems on standard datasets and is suitable for real-time processing of large amounts of Web data. With EmbedRank, we also explicitly increase coverage and diversity among the selected keyphrases by introducing an embedding-based maximal marginal relevance (MMR) for new phrases. A user study including over 200 votes showed that, although reducing the phrases’ semantic overlap leads to no gains in F-score, our high diversity selection is preferred by humans.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[3]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[4]  Tatsunori Mori,et al.  Information Gain Ratio meets Maximal Marginal Relevance - A method of Summarization for Multiple Documents , 2002, NTCIR.

[5]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[6]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[7]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[8]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[9]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[10]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[11]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.

[12]  Evaggelia Pitoura,et al.  Search result diversification , 2010, SGMD.

[13]  Yuli Ye,et al.  Max-Sum diversification, monotone submodular functions and dynamic updates , 2012, PODS '12.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Florian Boudin,et al.  TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction , 2013, IJCNLP.

[16]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[17]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[18]  Gianluca Demartini,et al.  Pooling-based continuous evaluation of information retrieval systems , 2015, Information Retrieval Journal.

[19]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[20]  Rui Wang Corpus-independent Generic Keyphrase Extraction Using Word Embedding Vectors , 2015 .

[21]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[22]  Cornelia Caragea,et al.  A Position-Biased PageRank Algorithm for Keyphrase Extraction , 2017, AAAI.

[23]  Florian Boudin,et al.  Unsupervised Keyphrase Extraction with Multipartite Graphs , 2018, NAACL.

[24]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.