Speaker independent discriminant feature extraction for acoustic pattern-matching

Acoustic pattern-matching algorithms have recently become prominent again for automatically processing speech utterances where no prior knowledge of the spoken language is required. Applications of such technology include, but are not limited to, query-by-example search, spoken term detection and automatic word discovery. Obtaining content-aware acoustic features as independent as possible from speaker and acoustic environment variations is a key step in these algorithms. Currently, GMM posteriorgrams are found to outperform the standard MFCC features even though they were not designed to optimize the discrimination between acoustic classes. In this paper we combine the K-means clustering algorithm with the GMM posteriorgrams front-end to obtain more discriminant features. Results on a query-by-example task show that the proposed approaches outperform standard MFCC features by 7.8% absolute P@N and GMM-based posteriorgram features by 3.7% absolute P@N when using a 64-dimensional feature vector.

[1]  Frédéric Bimbot,et al.  Zero-Resource Audio-Only Spoken Term Detection Based on a Combination of Template Matching Techniques , 2011, INTERSPEECH.

[2]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[3]  Nuria Oliver,et al.  Partial sequence matching using an Unbounded Dynamic Time Warping algorithm , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  C. Myers,et al.  A level building dynamic time warping algorithm for connected word recognition , 1981 .

[5]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Jithendra Vepa,et al.  Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[7]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[8]  Pierre-Michel Bousquet,et al.  Speaker Modeling Using Local Binary Decisions , 2011, INTERSPEECH.

[9]  Frédéric Bimbot,et al.  Audio keyword extraction by unsupervised word discovery , 2009, INTERSPEECH.

[10]  Lawrence R. Rabiner,et al.  Connected word recognition using a level building dynamic time warping algorithm , 1981, ICASSP.

[11]  Guillaume Gravier,et al.  Corpus description of the ESTER Evaluation Campaign for the Rich Transcription of French Broadcast News , 2004, LREC.