论文信息 - Identifying High Quality Document-Summary Pairs through Text Matching

Identifying High Quality Document-Summary Pairs through Text Matching

Text summarization namely, automatically generating a short summary of a given document, is a difficult task in natural language processing. Nowadays, deep learning as a new technique has gradually been deployed for text summarization, but there is still a lack of large-scale high quality datasets for this technique. In this paper, we proposed a novel deep learning method to identify high quality document–summary pairs for building a large-scale pairs dataset. Concretely, a long short-term memory (LSTM)-based model was designed to measure the quality of document–summary pairs. In order to leverage information across all parts of each document, we further proposed an improved LSTM-based model by removing the forget gate in the LSTM unit. Experiments conducted on the training set and the test set built upon Sina Weibo (a Chinese microblog website similar to Twitter) showed that the LSTM-based models significantly outperformed baseline models with regard to the area under receiver operating characteristic curve (AUC) value.

[1] Santosh S. Vempala,et al. Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[2] Preslav Nakov,et al. SemEval-2016 Task 3: Community Question Answering , 2019, *SEMEVAL.

[3] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4] Benjamin Van Durme,et al. Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[5] Houfeng Wang,et al. Learning Summary Prior Representation for Extractive Summarization , 2015, ACL.

[6] Dragomir R. Radev,et al. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[7] Dong-Hong Ji,et al. Deep Learning for Textual Entailment Recognition , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[8] Jason Weston,et al. A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[9] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[10] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[11] Wei Li. Abstractive Multi-document Summarization with Semantic Information Extraction , 2015, EMNLP.

[12] Anders Søgaard,et al. Unsupervised extractive summarization via coverage maximization with syntactic and semantic concepts , 2015, ACL.

[13] Xiaolong Wang,et al. An Auto-Encoder for Learning Conversation Representation Using LSTM , 2015, ICONIP.

[14] Jun Zhao,et al. Inner Attention based Recurrent Neural Networks for Answer Selection , 2016, ACL.

[15] Xiaolong Wang,et al. Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering , 2015, ACL.

[16] Kathleen R. McKeown,et al. SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[17] Erik Cambria,et al. Aspect extraction for opinion mining with a deep convolutional neural network , 2016, Knowl. Based Syst..

[18] Ting Liu,et al. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[19] Eduard H. Hovy,et al. Automated Text Summarization and the SUMMARIST System , 1998, TIPSTER.