论文信息 - EXPERIMENTS IN INFORMATION RETRIEVAL FROM SPOKEN DOCUMENTS

EXPERIMENTS IN INFORMATION RETRIEVAL FROM SPOKEN DOCUMENTS

This paper describes the experiments performed as part of the TREC-97 Spoken Document Retrieval Track. The task was to pick the correct document from 35 hours of recognized speech documents, based on a text query describing exactly one document. Among the experiments we described here are: Vocabulary size experiments to assess the effect of words missing from the speech recognition vocabulary; experiments with speech recognition using a stemmed language model; using confidence annotations that estimate of the correctness of each recognized word; using multiple hypotheses from the recognizer. And finally we also measured the effects of corpus size on the SDR task. Despite fairly high word error rates, information retrieval performance was only slightly degraded for speech recognizer transcribed documents.

[1] Gerard Salton,et al. The SMART Retrieval System , 1971 .

[2] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[3] Slava M. Katz,et al. Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[4] Alberto Maria Segre,et al. Programs for Machine Learning , 1994 .

[5] Herbert Gish,et al. Large vocabulary word scoring as a basis for transcription generation , 1995, EUROSPEECH.

[6] Stephen J. Cox,et al. Confidence measures for the SWITCHBOARD database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7] M. A. Siegler,et al. Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[8] Stanley F. Chen,et al. Language and Pronunciation Modeling in the CMU 1996 Hub 4 Evaluation , 1999 .