Efficient System Combination for Syllable-Confusion-Network-Based Chinese Spoken Term Detection

This paper examines the system combination issue for syllable-confusion-network (SCN)-based Chinese spoken term detection (STD). System combination for STD usually leads to improvements in accuracy but suffers from increased index size or complicated index structure. This paper explores methods for efficient combination of a word-based system and a syllable-based system while keeping the compactness of the indices. First, a composite SCN is generated using two approaches: lattice combination (The SCN is generated from a combined lattice) and confusion network combination (Two SCNs are combined into one). Then a simple compact index is constructed from this composite SCN by merging cross-system redundant information. The experimental result on a 60-hour corpus shows a relative accuracy improvement of 14.7% is achieved over the baseline syllable-based system. Meanwhile, it reduces the index size by 22.3% compared to the commonly adopted score combination method when achieves comparable accuracy.

[1]  Peng Yu,et al.  A hybrid word / phoneme-based approach for improved vocabulary-independent search in spontaneous speech , 2004, INTERSPEECH.

[2]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[3]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[4]  Jia Liu,et al.  A study of lattice-based spoken term detection for Chinese spontaneous speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[5]  Yonghong Yan,et al.  A fast fuzzy keyword spotting algorithm based on syllable confusion network , 2007, INTERSPEECH.

[6]  Yonghong Yan,et al.  A One-Pass Real-Time Decoder Using Memory-Efficient State Network , 2008, IEICE Trans. Inf. Syst..

[7]  Shi-wook Lee,et al.  Combining multiple subword representations for open-vocabulary spoken document retrieval , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[9]  James R. Glass,et al.  Open-Vocabulary Spoken Utterance Retrieval using Confusion Networks , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[10]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[11]  Lin-Shan Lee,et al.  Subword-based position specific posterior lattices (s-PSPL) for indexing speech information , 2007, INTERSPEECH.