Frequent word section extraction in a presentation speech by an effective dynamic programming algorithm.

Word frequency in a document has often been utilized in text searching and summarization. Similarly, identifying frequent words or phrases in a speech data set for searching and summarization would also be meaningful. However, obtaining word frequency in a speech data set is difficult, because frequent words are often special terms in the speech and cannot be recognized by a general speech recognizer. This paper proposes another approach that is effective for automatic extraction of such frequent word sections in a speech data set. The proposed method is applicable to any domain of monologue speech, because no language models or specific terms are required in advance. The extracted sections can be regarded as speech labels of some kind or a digest of the speech presentation. The frequent word sections are determined by detecting similar sections, which are sections of audio data that represent the same word or phrase. The similar sections are detected by an efficient algorithm, called Shift Continuous Dynamic Programming (Shift CDP), which realizes fast matching between arbitrary sections in the reference speech pattern and those in the input speech, and enables frame-synchronous extraction of similar sections. In experiments, the algorithm is applied to extract the repeated sections in oral presentation speeches recorded in academic conferences in Japan. The results show that Shift CDP successfully detects similar sections and identifies the frequent word sections in individual presentation speeches, without prior domain knowledge, such as language models and terms.

[1]  S. J. Sinclair,et al.  The development of the Otago speech database , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.

[2]  Herbert Gish,et al.  Extracting descriptive noun phrases from conversational speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Kazuyo Tanaka,et al.  A speech recognition method with a language-independent intermediate phonetic code , 2000, INTERSPEECH.

[4]  Yoshiaki Itoh,et al.  A proposal for a new algorithm of reference interval-free continuous DP for real-time speech or text retrieval , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Kazuyo Tanaka,et al.  Automatic labeling and digesting for lecture speech utilizing repeated speech by shift CDP , 2001, INTERSPEECH.

[6]  Sean Connolly,et al.  Improvements in switchboard recognition and topic identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Kunio Kashino,et al.  Quick audio retrieval using active search , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Yoshiaki Itoh,et al.  Automatic detection of topic boundaries and keywords in arbitrary speech using incremental reference interval-free continuous DP , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[10]  Steve J. Young,et al.  A fast lattice-based approach to vocabulary independent wordspotting , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Nobuaki Minematsu,et al.  Japanese dictation toolkit: plug-and-play framework for speech recognition R&D , 1999 .

[12]  Michiel Bacchiani Automatic transcription of voicemail at AT&T , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Jing Huang,et al.  Large vocabulary conversational speech recognition with the extended maximum likelihood linear transformation (EMLLT) model , 2002, INTERSPEECH.

[14]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[15]  Sadaoki Furui,et al.  Automatic speech summarization applied to English broadcast news speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Tatsuya Kawahara,et al.  Speaking rate dependent acoustic modeling for spontaneous lecture speech recognition , 2001, INTERSPEECH.

[17]  Geoffrey Zweig,et al.  Extracting caller information from voicemail , 2001, INTERSPEECH.

[18]  Yuji Matsumoto,et al.  Extended Models and Tools for High-performance Part-of-speech , 2000, COLING.

[19]  Herbert Gish,et al.  The 2001 BYBLOS English large vocabulary conversational speech recognition system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Ryu-ichi Oka,et al.  Speaker-independent word speech recognition using the blurred orientation patterns obtained from the vector field of spectrum , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[21]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[22]  Kate Knill,et al.  Fast implementation methods for Viterbi-based word-spotting , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.