Multimedia document retrieval using speech and speaker recognition

Abstract. Speech and speaker recognition systems are rapidly being deployed in real-world applications. In this paper, we discuss the details of a system and its components for indexing and retrieving multimedia content derived from broadcast news sources. The audio analysis component calls for real-time speech recognition for converting the audio to text and concurrent speaker analysis consisting of the segmentation of audio into acoustically homogeneous sections followed by speaker identification. The output of these two simultaneous processes is used to abstract statistics to automatically build indexes for text-based and speaker-based retrieval without user intervention. The real power of multimedia document processing is the possibility of Boolean queries in the form of combined text- and speaker-based user queries. Retrieval for such queries entails combining the results of individual text and speaker based searches. The underlying techniques discussed here can easily be extended to other speech-centric applications and transactions.

[1]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[2]  Daben Liu,et al.  Fast speaker change detection for broadcast news transcription and indexing , 1999, EUROSPEECH.

[3]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4]  Salim Roukos,et al.  Audio-Indexing For Broadcast News , 1998, TREC.

[5]  W. Grosky Pushing Streaming Video , 1997, IEEE MultiMedia.

[6]  Mahesh Viswanathan,et al.  Retrieval from spoken documents using content and speaker information , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[7]  C. Julian Chen,et al.  Speech recognition with automatic punctuation , 1999, EUROSPEECH.

[8]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[9]  Fernando Pereira,et al.  FINDING INFORMATION IN AUDIO: A NEW PARADIGM FOR AUDIO BROWSING AND RETRIEVAL , 1999 .

[10]  Homayoon S. M. Beigi,et al.  Ibm Model-Based And Frame-By-Frame Speaker-Recognition , 1998 .

[11]  Michael Picheny,et al.  Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[12]  Michael Picheny,et al.  Robust methods for using context-dependent features and models in a continuous speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[14]  Gerald Salton,et al.  Automatic text processing , 1988 .

[15]  Perrine Delacourt,et al.  Speaker-based segmentation for audio data indexing , 1999 .

[16]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[17]  H. Akaike A new look at the statistical model identification , 1974 .

[18]  Salim Roukos,et al.  Experimental Results in Audio Indexing , 1997 .

[19]  Homayoon S. M. Beigi,et al.  Multi-Environment Speaker Verification , 1999 .

[20]  Lalit R. Bahl,et al.  A tree search strategy for large-vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Salim Roukos,et al.  Story segmentation and topic detection for recognized speech , 1999, EUROSPEECH.

[22]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[23]  Stéphane H. Maes,et al.  A distance measure between collections of distributions and its application to speaker recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[24]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[25]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..