Spoken Term Detection Using Visual Spectrogram Matching

This work proposes a novel spoken term detection technique, where the query is in audio format. Detection and retrieval are performed by matching the spectrograms of the spoken document and query as visual images, using ideas from computer vision. Local descriptors are computed on a dense grid over each spectrogram, and the query term is detected using deformable template matching of grids. Detection experiments are performed on an hour-long newscast recording, involving 10 query terms of length 2-3 words. When the query term comes from the document, nearly all other instances of the term in the document are detected; performance degrades when the query is recorded by the user.

[1]  John H. L. Hansen,et al.  SPEECHFIND: spoken document retrieval for a national gallery of the spoken word , 2004, Proceedings of the 6th Nordic Signal Processing Symposium, 2004. NORSIG 2004..

[2]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[4]  Peng Yu,et al.  Vocabulary-independent search in spontaneous speech , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  D. Klatt,et al.  On the automatic recognition of continuous speech:Implications from a spectrogram-reading experiment , 1973 .

[7]  Shi-wook Lee,et al.  Open-vocabulary spoken document retrieval based on new subword models and subword phonetic similarity , 2006, INTERSPEECH.

[8]  Bhuvana Ramabhadran,et al.  Fast audio search using vector space modelling , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[9]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[10]  Jia Liu,et al.  A study of lattice-based spoken term detection for Chinese spontaneous speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Peter Schäuble,et al.  A system for retrieving speech documents , 1992, SIGIR '92.

[12]  Xu Wang,et al.  A frequency warping approach for vocal tract length normalization , 2004, Proceedings 7th International Conference on Signal Processing, 2004. Proceedings. ICSP '04. 2004..

[13]  Frank K. Soong,et al.  Divergence-Based Similarity Measure for Spoken Document Retrieval , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Karen Spärck Jones,et al.  The Cambridge University spoken document retrieval system , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Aaron E. Rosenberg,et al.  Performance tradeoffs in dynamic time warping algorithms for isolated word recognition , 1980 .

[16]  Derek Hoiem,et al.  Computer vision for music identification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).