Using acoustic dissimilarity measures based on state-level distance vector representation for improved spoken term detection

This paper proposes a simple approach to subword-based spoken term detection (STD) which uses improved acoustic dissimilarity measures based on a distance-vector representation at the state-level. Our approach assumes that both the query term and spoken documents are represented by subword units and then converted to the sequence of HMM states. A set of all distributions in subword-based HMMs is used for generating distance-vector representation of each state of all subword units. The element of a distance-vector corresponds to the distance between distributions of two different states, and thus a vector represents a structural feature at the state-level. The experimental result showed that the proposed method significantly outperforms the baseline method, which employs a conventional acoustic dissimilarity measure based on subword unit, with very little increase in the required search time.

[1]  Tatsuya Kawahara,et al.  Overview of the IR for Spoken Documents Task in NTCIR-9 Workshop , 2011, NTCIR.

[2]  Seiichi Nakagawa,et al.  Out-of-vocabulary term detection by n-gram array with distance from continuous syllable recognition results , 2010, 2010 IEEE Spoken Language Technology Workshop.

[3]  Seiichi Nakagawa,et al.  A robust/fast spoken term detection method based on a syllable n-gram index with a distance metric , 2013, Speech Commun..

[4]  Masashi Kimura,et al.  Phoneme Recognition Based on AF-HMMs with Optimal Parameter Set , 2012 .

[5]  Tatsuya Kawahara,et al.  Constructing Japanese test collections for spoken term detection , 2010, INTERSPEECH.

[6]  Lin-Shan Lee,et al.  Open-Vocabulary Retrieval of Spoken Content with Shorter/Longer Queries Considering Word/Subword-based Acoustic Feature Similarity , 2012, INTERSPEECH.

[7]  Richard M. Schwartz,et al.  White Listing and Score Normalization for Keyword Spotting of Noisy Speech , 2012, INTERSPEECH.

[8]  Keikichi Hirose,et al.  STRUCTURAL REPRESENTATION OF THE PRONUNCIATION AND ITS USE FOR CALL , 2006, 2006 IEEE Spoken Language Technology Workshop.

[9]  Keikichi Hirose,et al.  Japanese vowel recognition based on structural representation of speech , 2005, INTERSPEECH.

[10]  Seiichi Nakagawa,et al.  Efficient out-of-vocabulary term detection by n-gram array indices with distance from a syllable lattice , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Naoyuki Kanda,et al.  Open-vocabulary keyword detection from super-large scale speech database , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[12]  Hiromitsu Nishizaki,et al.  Spoken Term Detection Using Multiple Speech Recognizers' Outputs at NTCIR-9 SpokenDoc STD subtask , 2011, NTCIR.

[13]  Frédéric Bimbot,et al.  Zero-Resource Audio-Only Spoken Term Detection Based on a Combination of Template Matching Techniques , 2011, INTERSPEECH.

[14]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.