Optimizing Sentence Modeling and Selection for Document Summarization

Extractive document summarization aims to conclude given documents by extracting some salient sentences. Often, it faces two challenges: 1) how to model the information redundancy among candidate sentences; 2) how to select the most appropriate sentences. This paper attempts to build a strong summarizer DivSelect+CNNLM by presenting new algorithms to optimize each of them. Concretely, it proposes CNNLM, a novel neural network language model (NNLM) based on convolutional neural network (CNN), to project sentences into dense distributed representations, then models sentence redundancy by cosine similarity. Afterwards, it formulates the selection process as an optimization problem, constructing a diversified selection process (DivSelect) with the aim of selecting some sentences which have high prestige, meantime, are dissimilar with each other. Experimental results on DUC2002 and DUC2004 benchmark data sets demonstrate the effectiveness of our approach.

[1]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[2]  Wenpeng Yin,et al.  Automatic Multi-document Summarization Based on New Sentence Similarity Measures , 2012, PRICAI.

[3]  Ramiz M. Aliguliyev A Novel Partitioning-Based Clustering Method and Generic Document Summarization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  Dragomir R. Radev,et al.  DivRank: the interplay of prestige and diversity in information networks , 2010, KDD.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[11]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[12]  Xueqi Cheng,et al.  Decayed DivRank: capturing relevance, diversity and prestige in information networks , 2011, SIGIR '11.

[13]  Angela Repanovici,et al.  Expert Systems with Applications in the Legal Domain , 2015 .

[14]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[15]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Wenpeng Yin,et al.  Generic Multi-Document Summarization Using Topic-Oriented Information , 2012, PRICAI.

[18]  Xiaojun Wan,et al.  Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction , 2007, ACL.

[19]  Fan Zhang,et al.  SentTopic-MultiRank: a Novel Ranking Model for Multi-Document Summarization , 2012, COLING.

[20]  Jingrui He,et al.  GenDeR: A Generic Diversified Ranking Algorithm , 2012, NIPS.

[21]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[22]  Xiaojun Wan,et al.  Multi-document summarization using cluster-based link analysis , 2008, SIGIR '08.

[23]  Xiaojun Wan Document-Based HITS Model for Multi-document Summarization , 2008, PRICAI.

[24]  Shafiq R. Joty,et al.  Unsupervised Approach for Selecting Sentences in Query-based Summarization , 2008, FLAIRS Conference.

[25]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[26]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[27]  Dragomir R. Radev,et al.  LexPageRank: Prestige in Multi-Document Text Summarization , 2004, EMNLP.

[28]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[29]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[30]  Chris H. Q. Ding,et al.  Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization , 2008, SIGIR '08.

[31]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[32]  Jingrui He,et al.  Diversified ranking on large graphs: an optimization viewpoint , 2011, KDD.

[33]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.