论文信息 - On the Use of Automatic Speech Recognition for Spoken Information Retrieval from Video Databases

On the Use of Automatic Speech Recognition for Spoken Information Retrieval from Video Databases

This document describes the realization of a spoken information retrieval system and its application to words search in an indexed video database. The system uses an automatic speech recognition (ASR) software to convert the audio signal of a video file into a transcript file and then a document indexing tool to index this transcripted file. Then, a spoken query, uttered by any user, is presented to the ASR to decode the audio signal and propose a hypothesis that is later used to formulate a query to the indexed database. The final outcome of the system is a list of video frame tags containing the audio correspondent to the spoken query. The speech recognition system achieved less than 15% Word Error Rate (WER) and its combined operation with the document indexing system showed outstanding performance with spoken queries.

Juan Arturo Nolazco-Flores | Luis R. Salgado-Garza | J. Nolazco-Flores

[1] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2] Bhiksha Raj,et al. The MERL SpokenQuery information retrieval system a system for retrieving pertinent documents from a spoken query , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[3] Luis Enrique Sucar,et al. MICAI 2004: Advances in Artificial Intelligence , 2004, Lecture Notes in Computer Science.

[4] M. A. Siegler,et al. Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[5] Karen Spärck Jones,et al. Experiments in Spoken Document Retrieval , 1996, Inf. Process. Manag..

[6] Richard M. Schwartz,et al. A hidden Markov model information retrieval system , 1999, SIGIR '99.

[7] Ian H. Witten,et al. Text mining in a digital library , 2004, International Journal on Digital Libraries.

[8] Lin-Shan Lee,et al. Retrieval of broadcast news speech in Mandarin Chinese collected in Taiwan using syllable-level statistical characteristics , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9] Hsin-Min Wang,et al. Content-based language models for spoken document retrieval , 2000, IRAL '00.

[10] John H. L. Hansen,et al. Discrete-Time Processing of Speech Signals , 1993 .

[11] Ian H. Witten,et al. Compressing and indexing documents and images , 1999 .

[12] Richard M. Stern,et al. N-Best List Rescoring Using Syntactic Trigrams , 2004, MICAI.

[13] Hsin-Min Wang,et al. Content-based Language Models for Spoken Document Retrieval , 2001, Int. J. Comput. Process. Orient. Lang..

[14] Ian H. Witten,et al. Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .