On the Use of Automatic Speech Recognition for Spoken Information Retrieval from Video Databases

This document describes the realization of a spoken information retrieval system and its application to words search in an indexed video database. The system uses an automatic speech recognition (ASR) software to convert the audio signal of a video file into a transcript file and then a document indexing tool to index this transcripted file. Then, a spoken query, uttered by any user, is presented to the ASR to decode the audio signal and propose a hypothesis that is later used to formulate a query to the indexed database. The final outcome of the system is a list of video frame tags containing the audio correspondent to the spoken query. The speech recognition system achieved less than 15% Word Error Rate (WER) and its combined operation with the document indexing system showed outstanding performance with spoken queries.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Bhiksha Raj,et al.  The MERL SpokenQuery information retrieval system a system for retrieving pertinent documents from a spoken query , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[3]  Luis Enrique Sucar,et al.  MICAI 2004: Advances in Artificial Intelligence , 2004, Lecture Notes in Computer Science.

[4]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[5]  Karen Spärck Jones,et al.  Experiments in Spoken Document Retrieval , 1996, Inf. Process. Manag..

[6]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[7]  Ian H. Witten,et al.  Text mining in a digital library , 2004, International Journal on Digital Libraries.

[8]  Lin-Shan Lee,et al.  Retrieval of broadcast news speech in Mandarin Chinese collected in Taiwan using syllable-level statistical characteristics , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  Hsin-Min Wang,et al.  Content-based language models for spoken document retrieval , 2000, IRAL '00.

[10]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[11]  Ian H. Witten,et al.  Compressing and indexing documents and images , 1999 .

[12]  Richard M. Stern,et al.  N-Best List Rescoring Using Syntactic Trigrams , 2004, MICAI.

[13]  Hsin-Min Wang,et al.  Content-based Language Models for Spoken Document Retrieval , 2001, Int. J. Comput. Process. Orient. Lang..

[14]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .