论文信息 - Maximum lexical cohesion for fine-grained news story segmentation

Maximum lexical cohesion for fine-grained news story segmentation

We propose a maximum lexical cohesion (MLC) approach to news story segmentation. Unlike sentence-dependent lexical methods, our approach is able to detect story boundaries at finer word/subword granularity, and thus is more suitable for speech recognition transcripts which have no sentence delimiters. The proposed segmentation goodness measure takes account of both lexical cohesion and a prior preference of story length. We measure the lexical cohesion of a segment by the KL-divergence from its word distribution to an associated piecewise uniform distribution. Taking account of the uneven contributions of different words to a story, the cohesion measure is further refined by two word weighting schemes, i.e. the inverse document frequency (IDF) and a new weighting method called difference from expectation (DFE). We then propose a dynamic programming solution to exactly maximize the segmentation goodness and efficiently locate story boundaries in polynomial time. Experimental results show that our MLC approach outperforms several state-of-the-art lexical methods. Index Terms: story segmentation, KL-divergence, lexical cohesion, word weighting, dynamic programming, spoken document segmentation, spoken document retrieval

[1] Marti A. Hearst. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[2] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3] Lei Xie,et al. Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation , 2007, INTERSPEECH.

[4] Jin Zhang,et al. A Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News , 2009, AIRS.

[5] Lei Xie,et al. Subword Latent Semantic Analysis for Texttiling-Based Automatic Story Segmentation of Chinese Broadcast News , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[6] Athanasios Kehagias,et al. A Dynamic Programming Algorithm for Linear Text Segmentation , 2004, Journal of Intelligent Information Systems.

[7] Igor Malioutov,et al. Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[8] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9] Johanna D. Moore,et al. Latent Semantic Analysis for Text Segmentation , 2001, EMNLP.