Combining a mixture language model and Naive Bayes for multi-document summarisation

The TNO system for multi-document summarisation is based on an extraction approach. We combined two statistical methods for sentence selection with a variant of the MMR algorithm. After sentence segmentation, each sentence is scored on the basis of two probabilistic models. The first model scores sentences based on a (generative) unigram language model, which is a mixture of a cluster model, a document model and a background model, this score is compared to the probability that the sentence is generated by just the background model. The resulting log likelihood ratio is normalised on the basis of sentence length. The second model is a simple Bayesian model based on several non-content sentence features: sentence position, sentence length and cue phrases. The scores of both models yield a likelihood ratio score which are combined to yield a more reliable salience score. Finally, the summary is constructed by selecting the most salient sentence and add sentences which are both salient and do give new information in an incremental fashion.