Story segmentation for speech transcripts in sparse data conditions

Information Retrieval systems determine relevance by comparing information needs with the content of potential retrieval units. Unlike most textual data, automatically generated speech transcripts cannot by default be easily divided into obvious retrieval units due to a lack of explicit structural markers. This problem can be addressed by automatically detecting topically cohesive segments, or stories. However, when the content collection consists of speech from less formal domains than broadcast news, most of the standard automatic boundary detection methods are potentially unsuitable due to their reliance on learned features. In particular for conversational speech, the lack of adequate training data can present a significant issue. In this paper four methods for automatic segmentation of speech transcriptions are compared. These are selected because of their independence from collection specific knowledge and implemented without the use of training data. Two of the four methods are based on existing algorithms, the others are novel approaches based on a dynamic segmentation algorithm (QDSA) that incorporates information about the query, and WordNet. Experiments were done on a task similar to TREC SDR unknown boundaries condition. For the best performing system, QDSA, the retrieval scores for a tfidf-type ranking function were equivalent to a reference segmentation, and improved through document length normalization using the bm25/Okapi method. For the task of automatically segmenting speech transcripts for use in information retrieval, we conclude that a training-poor processing paradigm which can be crucial for handling surprise data is feasible.

[1]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[2]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[3]  George Doddington The Topic Detection and Tracking Phase 2 (TDT2) evaluation plan , 1998 .

[4]  Tomas E. Ward,et al.  Segmentation and detection at IBM: Hybrid statistical models and two-tiered clustering broadcast new , 2000 .

[5]  Alan F. Smeaton,et al.  Segmenting broadcast news streams using lexical chains , 2002 .

[6]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[7]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[8]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[9]  Yiming Yang,et al.  CMU Report on TDT-2: Segmentation, Detection and Tracking , 1999 .

[10]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[11]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[12]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[13]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[14]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[15]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[16]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[17]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[18]  Steve Renals,et al.  The THISL SDR System At TREC-8 , 1999, TREC.