Identifying High Quality Document-Summary Pairs through Text Matching

Text summarization namely, automatically generating a short summary of a given document, is a difficult task in natural language processing. Nowadays, deep learning as a new technique has gradually been deployed for text summarization, but there is still a lack of large-scale high quality datasets for this technique. In this paper, we proposed a novel deep learning method to identify high quality document–summary pairs for building a large-scale pairs dataset. Concretely, a long short-term memory (LSTM)-based model was designed to measure the quality of document–summary pairs. In order to leverage information across all parts of each document, we further proposed an improved LSTM-based model by removing the forget gate in the LSTM unit. Experiments conducted on the training set and the test set built upon Sina Weibo (a Chinese microblog website similar to Twitter) showed that the LSTM-based models significantly outperformed baseline models with regard to the area under receiver operating characteristic curve (AUC) value.

[1]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[2]  Preslav Nakov,et al.  SemEval-2016 Task 3: Community Question Answering , 2019, *SEMEVAL.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Benjamin Van Durme,et al.  Annotated Gigaword , 2012, AKBC-WEKEX@NAACL-HLT.

[5]  Houfeng Wang,et al.  Learning Summary Prior Representation for Extractive Summarization , 2015, ACL.

[6]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[7]  Dong-Hong Ji,et al.  Deep Learning for Textual Entailment Recognition , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[8]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[9]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[10]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[11]  Wei Li Abstractive Multi-document Summarization with Semantic Information Extraction , 2015, EMNLP.

[12]  Anders Søgaard,et al.  Unsupervised extractive summarization via coverage maximization with syntactic and semantic concepts , 2015, ACL.

[13]  Xiaolong Wang,et al.  An Auto-Encoder for Learning Conversation Representation Using LSTM , 2015, ICONIP.

[14]  Jun Zhao,et al.  Inner Attention based Recurrent Neural Networks for Answer Selection , 2016, ACL.

[15]  Xiaolong Wang,et al.  Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering , 2015, ACL.

[16]  Kathleen R. McKeown,et al.  SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[17]  Erik Cambria,et al.  Aspect extraction for opinion mining with a deep convolutional neural network , 2016, Knowl. Based Syst..

[18]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[19]  Eduard H. Hovy,et al.  Automated Text Summarization and the SUMMARIST System , 1998, TIPSTER.

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[21]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[22]  Yi Han,et al.  A Bidirectional LSTM Model for Question Title and Body Analysis in Question Answering , 2016, 2016 IEEE First International Conference on Data Science in Cyberspace (DSC).

[23]  Qingcai Chen,et al.  LCSTS: A Large Scale Chinese Short Text Summarization Dataset , 2015, EMNLP.

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[26]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[28]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[29]  Xuanjing Huang,et al.  Deep Fusion LSTMs for Text Semantic Matching , 2016, ACL.

[30]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[31]  Lei Yu,et al.  Deep Learning for Answer Sentence Selection , 2014, ArXiv.

[32]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[33]  Bowen Zhou,et al.  Improved Representation Learning for Question Answer Matching , 2016, ACL.

[34]  Horacio Rodríguez,et al.  Support Vector Machines for Query-focused Summarization trained and evaluated on Pyramid data , 2007, ACL.

[35]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[36]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Kam-Fai Wong,et al.  Extractive Summarization Using Supervised and Semi-Supervised Learning , 2008, COLING.

[41]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[42]  Mirella Lapata,et al.  Neural Summarization by Extracting Sentences and Words , 2016, ACL.

[43]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[44]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[45]  Paul Over,et al.  DUC in context , 2007, Inf. Process. Manag..

[46]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[47]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[48]  Trevor Hastie,et al.  Support Vector Machines , 2013 .

[49]  Dipanjan Das Andr,et al.  A Survey on Automatic Text Summarization , 2007 .

[50]  Noah A. Smith,et al.  Extractive Summarization by Maximizing Semantic Volume , 2015, EMNLP.

[51]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[52]  Jiawei Han,et al.  Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions , 2010, COLING.

[53]  Claire Grover,et al.  Improving Topic Model Clustering of Newspaper Comments for Summarisation , 2016, ACL.

[54]  Zhichang Zhang,et al.  Chinese Textual Entailment Recognition Enhanced with Word Embedding , 2015, CCL.

[55]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[56]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[57]  Arunima Jaiswal,et al.  Trends in Extractive and Abstractive Techniques in Text Summarization , 2015 .