Audio context recognition in variable mobile environments from short segments using speaker and language recognizers

The problem of context recognition from mobile audio data is considered. We consider ten different audio contexts (such as car, bus, office and outdoors) prevalent in daily life situations. We choose mel-frequency cepstral coefficient (MFCC) parametrization and present an extensive comparison of six different classifiers: knearest neighbor (kNN), vector quantization (VQ), Gaussian mixture model trained with both maximum likelihood (GMM-ML) and maximum mutual information (GMM-MMI) criteria, GMM supervector support vector machine (GMM-SVM) and, finally, SVM with generalized linear discriminant sequence (GLDS-SVM). After all parameter optimizations, GMM-MMI and and VQ classifiers perform the best with 52.01 %, and 50.34 % context identification rates, respectively, using 3-second data records. Our analysis reveals further that none of the six classifiers is superior to each other when class-, useror phone-specific accuracies are considered.

[1]  Ben P. Milner,et al.  Acoustic environment classification , 2006, TSLP.

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  Shrikanth Narayanan,et al.  Environmental Sound Recognition With Time–Frequency Audio Features , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[6]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[8]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[9]  Lukás Burget,et al.  Discriminative Training Techniques for Acoustic Language Identification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Alvin F. Martin,et al.  NIST 2003 language recognition evaluation , 2003, INTERSPEECH.

[11]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[12]  Unto K. Laine,et al.  Comparison of classifiers in audio and acceleration based context classification in mobile phones , 2011, 2011 19th European Signal Processing Conference.

[13]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[14]  Lukás Burget,et al.  Application of speaker- and language identification state-of-the-art techniques for emotion recognition , 2011, Speech Commun..

[15]  Jukka Riekki,et al.  An Implementation of Auditory Context Recognition for Mobile Devices , 2009, 2009 Tenth International Conference on Mobile Data Management: Systems, Services and Middleware.

[16]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[17]  Sridha Sridharan,et al.  Experiments in SVM-based Speaker Verification Using Short Utterances , 2010, Odyssey.

[18]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[19]  J. Riekki,et al.  Auditory Context Recognition Using SVMs , 2008, 2008 The Second International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies.

[20]  Waltenegus Dargie,et al.  Adaptive Audio-Based Context Recognition , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[21]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[22]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[23]  David G. Stork,et al.  Pattern Classification , 1973 .

[24]  Tuomas Virtanen,et al.  Audio context recognition using audio event histograms , 2010, 2010 18th European Signal Processing Conference.