Title Generation and Keyphrase Extraction from Persian Scientific Texts

Modern neural-based approaches, which usually rely on large volumes of training data, have presented magnificent progress in various fields of text processing. However, these approaches have not been studied adequately in low resource languages. In this paper we focus on title generation and keyphrase extraction in the Persian language. We build a large corpus of scientific Persian texts which enables us to train end-to-end neural models for generating titles and extracting keyphrases. We investigate the effect of input length on modeling Persian text in both tasks. Additionally, we compare subword-level processing with the word-level one and show that even a straightforward subword encoding method enhances results greatly on Persian as an agglutinative language. For keyphrase extraction we formulate the task in two different ways: training the model to output all keyphrases at once; training the model to output one keyphrase each time and then extract n-best keyphrases during decoding. The latter improves the performance greatly.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Min-Yen Kan,et al.  Keyphrase Extraction in Scientific Publications , 2007, ICADL.

[3]  Naoaki Okazaki,et al.  Neural Headline Generation on Abstract Meaning Representation , 2016, EMNLP.

[4]  Javad Ghofrani,et al.  Persianp: A Persian Text Processing Toolbox , 2016, CICLing.

[5]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[6]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[7]  Weidong Xiao,et al.  Keyphrase Generation Based on Deep Seq2seq Model , 2018, IEEE Access.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[10]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[11]  Alexander M. Rush,et al.  Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[12]  Laurent Romary,et al.  HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID , 2010, *SEMEVAL.

[13]  Yang Zhao,et al.  A Language Model based Evaluator for Sentence Compression , 2018, ACL.

[14]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[15]  Zhiyuan Liu,et al.  Recent Advances on Neural Headline Generation , 2017, Journal of Computer Science and Technology.

[16]  Lukasz Kaiser,et al.  Sentence Compression by Deletion with LSTMs , 2015, EMNLP.

[17]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[18]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[19]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[20]  Shuguang Han,et al.  Deep Keyphrase Generation , 2017, ACL.

[21]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[22]  Bo Zhao,et al.  PTR: Phrase-Based Topical Ranking for Automatic Keyphrase Extraction in Scientific Publications , 2016, ICONIP.

[23]  Xuanjing Huang,et al.  Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter , 2016, EMNLP.

[24]  Xiaoming Zhang,et al.  Keyphrase Generation with Correlation Constraints , 2018, EMNLP.