We describe details of our runs and the results obtained for the "IR for Spoken Documents (SpokenDoc) Task" at NTCIR-9. The focus of our participation in this task was the investigation of the use of segmentation methods to divide the manual and ASR transcripts into topically coherent segments. The underlying assumption of this approach is that these segments will capture passages in the transcript relevant to the query. Our experiments investigate the use of two lexical coherence based segmentation algorithms (Text-Tiling, C99). These are run on the provided manual and ASR transcripts, and the ASR transcript with stop words removed. Evaluation of the results shows that TextTiling consistently performs better than C99 both in segmenting the data into retrieval units as evaluated using the centre located relevant information metric and in having higher content precision in each automatically created segment.
[1]
Tatsuya Kawahara,et al.
Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop
,
2011,
NTCIR.
[2]
Tatsuya Kawahara,et al.
Test Collections for Spoken Document Retrieval from Lecture Audio Data
,
2008,
LREC.
[3]
Freddy Y. Y. Choi.
Advances in domain independent linear text segmentation
,
2000,
ANLP.
[4]
Hitoshi Isahara,et al.
Spontaneous Speech Corpus of Japanese
,
2000,
LREC.
[5]
Djoerd Hiemstra,et al.
Using language models for information retrieval
,
2001
.
[6]
Marti A. Hearst.
Text tiling: A quantitative approach to discourse segmentation
,
1993,
ACL 1993.
[7]
Ellen M. Voorhees,et al.
The TREC Spoken Document Retrieval Track: A Success Story
,
2000,
TREC.