Open Domain Web Keyphrase Extraction Beyond Language Modeling

This paper studies keyphrase extraction in real-world scenarios where documents are from diverse domains and have variant content quality. We curate and release OpenKP, a large scale open domain keyphrase extraction dataset with near one hundred thousand web documents and expert keyphrase annotations. To handle the variations of domain and content quality, we develop BLING-KPE, a neural keyphrase extraction model that goes beyond language understanding using visual presentations of documents and weak supervision from search queries. Experimental results on OpenKP confirm the effectiveness of BLING-KPE and the contributions of its neural architecture, visual features, and search log weak supervision. Zero-shot evaluations on DUC-2001 demonstrate the improved generalization ability of learning from the open domain data compared to a specific domain.

[1]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[2]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[3]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[4]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[5]  Mo Chen,et al.  A practical system of keyphrase extraction for web pages , 2005, CIKM '05.

[6]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[7]  Xiaojun Wan,et al.  CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction , 2008, COLING.

[8]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[9]  Chau Q. Nguyen,et al.  An Ontology-Based Approach for Key Phrase Extraction , 2009, ACL/IJCNLP.

[10]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[11]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[12]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[13]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[14]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[15]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[16]  Mark Levene,et al.  Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[17]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[18]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Shuguang Han,et al.  Deep Keyphrase Generation , 2017, ACL.

[21]  Isabelle Augenstein,et al.  SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications , 2017, *SEMEVAL.

[22]  Generating Diverse Numbers of Diverse Keyphrases , 2018, ArXiv.

[23]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[24]  Xiaoming Zhang,et al.  Keyphrase Generation with Correlation Constraints , 2018, EMNLP.

[25]  Lu Wang,et al.  Semi-Supervised Learning for Neural Keyphrase Generation , 2018, EMNLP.

[26]  Teruko Mitamura,et al.  Automatic Event Salience Identification , 2018, EMNLP.

[27]  Tie-Yan Liu,et al.  Towards Better Text Understanding and Retrieval through Kernel Entity Salience Modeling , 2018, SIGIR.

[28]  Michael R. Lyu,et al.  Title-Guided Encoding for Keyphrase Generation , 2018, AAAI.