Design and Implementation of Keyphrase Extraction Engine for Chinese Scientific Literature

Accurate keyphrases summarize the main topics, which are important for information retrieval and many other natural language processing tasks. In this paper, we construct a keyphrase extraction engine for Chinese scientific literature to assist researchers in improving the efficiency of scientific research. There are four key technical problems in the process of building the engine: how to select a keyphrase extraction algorithm, how to build a large-scale training set to achieve application-level performance, how to adjust and optimize the model to achieve better application results, and how to be conveniently invoked by researchers. Aiming at the above problems, we propose corresponding solutions. The engine is able to automatically recommend four to five keyphrases for the Chinese scientific abstracts given by the user, and the response speed is generally within 3 seconds. The keyphrase extraction engine for Chinese scientific literature is developed based on advanced deep learning algorithms, large-scale training set, and high-performance computing capacity, which might be an effective tool for ∗Corresponding Author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. EEKE2021, September 27-30, 2021, Illinois, USA © 2021 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn researchers and publishers to quickly capture the key stating points of scientific text.

[1]  Mark S. Staveley,et al.  Phrasier: a system for interactive document retrieval using keyphrases , 1999, SIGIR '99.

[2]  Jie Li,et al.  Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling , 2021, EEKE@JCDL.

[3]  Jing Tang,et al.  Keyphrase Extraction with Span-based Feature Representations , 2020, ArXiv.

[4]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[5]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[6]  Evangelos E. Milios,et al.  World Wide Web site summarization , 2004, Web Intell. Agent Syst..

[7]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[8]  Gábor Berend,et al.  Opinion Expression Mining by Exploiting Keyphrase Extraction , 2011, IJCNLP.

[9]  Xuanjing Huang,et al.  Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter , 2016, EMNLP.

[10]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[11]  Anette Hulth,et al.  A Study on Automatically Extracted Keywords in Text Categorization , 2006, ACL.

[12]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[13]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[14]  Roger Zimmermann,et al.  Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings , 2019, ArXiv.

[15]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[16]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[17]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[18]  Yi-fang Brook Wu,et al.  Domain-specific keyphrase extraction , 2005, CIKM '05.

[19]  David Yarowsky,et al.  Techniques in Speech Acoustics , 1999, Computational Linguistics.

[20]  Huan Zhang,et al.  Chinese clinical named entity recognition with variant neural structures based on BERT methods , 2020, J. Biomed. Informatics.

[21]  Chin-Yew Lin,et al.  Towards Improving Neural Named Entity Recognition with Gazetteers , 2019, ACL.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.