An algorithm for similar utterance section extraction for managing spoken documents

This paper proposes a new, efficient algorithm for extracting similar sections between two time sequence data sets. The algorithm, called Relay Continuous Dynamic Programming (Relay CDP), realizes fast matching between arbitrary sections in the reference pattern and the input pattern and enables the extraction of similar sections in a frame synchronous manner. In addition, Relay CDP is extended to two types of applications that handle spoken documents. The first application is the extraction of repeated utterances in a presentation or a news speech because repeated utterances are assumed to be important parts of the speech. These repeated utterances can be regarded as labels for information retrieval. The second application is flexible spoken document retrieval. A phonetic model is introduced to cope with the speech of different speakers. The new algorithm allows a user to query by natural utterance and searches spoken documents for any partial matches to the query utterance. We present herein a detailed explanation of Relay CDP and the experimental results for the extraction of similar sections and report results for two applications using Relay CDP.

[1]  Kyoungro Yoon,et al.  Mid-Level Music Melody Representation of Polyphonic Audio for Query-by-Humming System , 2002, ISMIR.

[2]  Francine R. Chen,et al.  The use of emphasis to automatically summarize a spoken discourse , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Sean Connolly,et al.  Improvements in switchboard recognition and topic identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Shi-wook Lee,et al.  Multilayer subword units for open-vocabulary spoken document retrieval , 2004, INTERSPEECH.

[5]  Trumpington Street,et al.  A FAST LATTICE-BASED APPROACH TO VOCABULARY INDEPENDENT WORDSPOTTING , 1994 .

[6]  Tatsuya Kawahara,et al.  Speaking rate dependent acoustic modeling for spontaneous lecture speech recognition , 2001, INTERSPEECH.

[7]  Robert William Albright,et al.  The International Phonetic Alphabet: Its Backgrounds and Development , 1958 .

[8]  Yuji Matsumoto,et al.  Extended Models and Tools for High-performance Part-of-speech , 2000, COLING.

[9]  Yoshiaki Itoh,et al.  Speech data retrieval system constructed on a universal phonetic code domain , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[10]  Takuichi Nishimura Music Signal Spotting Retrieval by a Humming Query Using Start Frame Feature Dependent Continuous Dynamic Programming , 2001, ISMIR.

[11]  Yoshiaki Itoh,et al.  A proposal for a new algorithm of reference interval-free continuous DP for real-time speech or text retrieval , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  R. J. Lickley,et al.  Proceedings of the International Conference on Spoken Language Processing. , 1992 .

[13]  Kate Knill,et al.  Fast implementation methods for Viterbi-based word-spotting , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Ryu-ichi Oka,et al.  Speaker-independent word speech recognition using the blurred orientation patterns obtained from the vector field of spectrum , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[15]  Shuichi Itahashi,et al.  Design and Development of Japanese Speech Corpus for Large Vocabulary Continuous Speech Recognition Assessment , 1998 .

[16]  Herbert Gish,et al.  Extracting descriptive noun phrases from conversational speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Karen Spärck Jones,et al.  Unconstrained keyword spotting using phone lattices with application to spoken document retrieval , 1997, Comput. Speech Lang..

[18]  Peter Schäuble,et al.  New techniques for open-vocabulary spoken document retrieval , 1998, SIGIR '98.

[19]  Kunio Kashino,et al.  A quick search method for audio and video signals based on histogram pruning , 2003, IEEE Trans. Multim..

[20]  Victor Zue,et al.  Subword unit representations for spoken document retrieval , 1997, EUROSPEECH.

[21]  Herbert Gish,et al.  The 2001 BYBLOS English large vocabulary conversational speech recognition system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Peter Trudgill,et al.  Dialectology: The international phonetic alphabet , 1998 .

[23]  S. J. Sinclair,et al.  The development of the Otago speech database , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.

[24]  Nobuaki Minematsu,et al.  Japanese Dictation Toolkit-1997 version- , 1999 .

[25]  Kazuyo Tanaka,et al.  Open-Vocabulary Spoken Document Retrieval Based On Multilingual Subphonetic Segment Recognition , 2004 .

[26]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[27]  Yoshiaki Itoh,et al.  Automatic detection of topic boundaries and keywords in arbitrary speech using incremental reference interval-free continuous DP , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[28]  Steffen Pauws,et al.  CubyHum: a fully operational "query by humming" system , 2002, ISMIR.

[29]  Kazuyo Tanaka,et al.  Speech labeling and the most frequent phrase extraction using same section in a presentation speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Ellen M. Voorhees,et al.  The TREC-6 Spoken Document Retrieval Track , 2005 .

[31]  Kazuyo Tanaka,et al.  A speech recognition method with a language-independent intermediate phonetic code , 2000, INTERSPEECH.

[32]  Kazuyo Tanaka,et al.  Automatic labeling and digesting for lecture speech utilizing repeated speech by shift CDP , 2001, INTERSPEECH.

[33]  Michiel Bacchiani Automatic transcription of voicemail at AT&T , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[34]  Jing Huang,et al.  Large vocabulary conversational speech recognition with the extended maximum likelihood linear transformation (EMLLT) model , 2002, INTERSPEECH.

[35]  Sadaoki Furui,et al.  A Statistical Approach to Automatic Speech Summarization , 2003, EURASIP J. Adv. Signal Process..

[36]  Geoffrey Zweig,et al.  Information Extraction from Voicemail , 2001, ACL.

[37]  John C. Wells,et al.  Computer-coding the IPA: a proposed extension of SAMPA , 1995 .

[38]  Re. Techniques for Information Retrieval from Speech Messages , 1991 .

[39]  Yoshiaki Itoh A matching algorithm between arbitrary sections of two speech data sets for speech retrieval , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).