Scientific Keyphrase Extraction: Extracting Candidates with Semi-supervised Data Augmentation

Keyphrase extraction can provide effective ways of organizing scientific documents. For this task, neural-based methods usually suffer from performance unstability due to data scarcity. In this paper, we adopt the pipeline two-step method including candidate extraction and keyphrase ranking, where candidate extraction is a key to influence the whole performance. In the candidate extraction step, to overcome the low-recall problem of traditional rule-based method, we propose a novel semi-supervised data augmentation method, where a neural-based tagging model and a discriminative classifier boost each other and get more confident phrases as candidates. With more reasonable candidates, keyphrase are identified with recall promoted. Experiments on SemEval 2017 Task 10 show that our model can achieve competitive results.

[1]  Liang Wang,et al.  PKU_ICL at SemEval-2017 Task 10: Keyphrase Extraction with Model Ensemble and External Knowledge , 2017, *SEMEVAL.

[2]  Jungo Kasai,et al.  Robust Multilingual Part-of-Speech Tagging via Adversarial Training , 2017, NAACL.

[3]  Laurent Romary,et al.  HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID , 2010, *SEMEVAL.

[4]  Mari Ostendorf,et al.  Scientific Information Extraction with Semi-supervised Neural Tagging , 2017, EMNLP.

[5]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Chen Wang,et al.  Experiment Research on Feature Selection and Learning Method in Keyphrase Extraction , 2009, ICCPOL.

[8]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[9]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[10]  Chandra Bhagavatula,et al.  The AI2 system at SemEval-2017 Task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction , 2017, *SEMEVAL.

[11]  Ana L. C. Bazzan,et al.  Balancing Training Data for Automated Annotation of Keywords: a Case Study , 2003, WOB.

[12]  Mirella Lapata,et al.  Learning to Paraphrase for Question Answering , 2017, EMNLP.

[13]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[14]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[15]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[16]  Radha Poovendran,et al.  Deceiving Google's Perspective API Built for Detecting Toxic Comments , 2017, ArXiv.

[17]  Chen Wang,et al.  CoRankBayes: bayesian learning to rank under the co-training framework and its application in keyphrase extraction , 2011, CIKM '11.

[18]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[19]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Friedhelm Schwenker,et al.  Ensemble Methods: Foundations and Algorithms [Book Review] , 2013, IEEE Computational Intelligence Magazine.

[25]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[26]  James H. Martin,et al.  SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction , 2015, *SEMEVAL.

[27]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[28]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[29]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.