Incorporating word embeddings in the hierarchical dirichlet process for query-oriented text summarization

The ever-growing amount of textual data available online creates the need for automatic text summarization tools. Probabilistic topic models are able to infer semantic relationships between sentences which is a key step of extractive summarization methods. However, they strongly rely on word co-occurrence patterns and fail to capture the actual semantic relationships between words such as synonymy, antonymy, etc. We propose a novel algorithm which incorporates pre-trained word embeddings in the probabilistic topic model in order to capture semantic similarities between sentences. These similarities provide the basis for a sentence ranking algorithm for query-oriented summarization. The summary is then produced by extracting highly ranked sentences from the original corpus. Our method is shown to outperform state-of-the-art algorithms on a benchmark dataset.

[1]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[2]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.

[3]  R. Barilay,et al.  Using lexical chains for text summarization , 1999 .

[4]  Balaraman Ravindran,et al.  Latent Dirichlet Allocation and Singular Value Decomposition Based Multi-document Summarization , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[6]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Donghong Ji,et al.  Query-focused multi-document summarization using hypergraph-based ranking , 2016, Inf. Process. Manag..

[9]  Dragomir R. Radev,et al.  Biased LexRank: Passage retrieval using random walks with question-based priors , 2009, Inf. Process. Manag..

[10]  Chong Wang,et al.  A Split-Merge MCMC Algorithm for the Hierarchical Dirichlet Process , 2012, ArXiv.

[11]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Dragomir R. Radev,et al.  DivRank: the interplay of prestige and diversity in information networks , 2010, KDD.

[14]  Leonhard Hennig,et al.  Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis , 2009, RANLP.