Calibrating DNN Posterior Probability Estimates of HMM/DNN Models to Improve Social Signal Detection from Audio Data

To detect social signals such as laughter or filler events from audio data, a straightforward choice is to apply a Hidden Markov Model (HMM) in combination with a Deep Neural Network (DNN) that supplies the local class posterior estimates (HMM/DNN hybrid model). However, the posterior estimates of the DNN may be suboptimal due to a mismatch between the cost function used during training (e.g. frame-level crossentropy) and the actual evaluation metric (e.g. segment-level F1 score). In this study, we show experimentally that by employing a simple posterior probability calibration technique on the DNN outputs, the performance of the HMM/DNN workflow can be significantly improved. Specifically, we apply a linear transformation on the activations of the output layer right before using the softmax function, and fine-tune the parameters of this transformation. Out of the calibration approaches tested, we got the best F1 scores when the posterior calibration process was adjusted so as to maximize the actual HMM-based evaluation metric.

[1]  Gábor Gosztolya,et al.  Social Signal Detection by Probabilistic Sampling DNN Training , 2020, IEEE Transactions on Affective Computing.

[2]  J. Vermunt,et al.  Posterior Calibration of Posterior Predictive p Values , 2017, Psychological methods.

[3]  Kristof Coussement,et al.  A probability-mapping algorithm for calibrating the posterior probabilities: A direct marketing application , 2011, Eur. J. Oper. Res..

[4]  Nikolaus Hansen,et al.  Evaluating the CMA Evolution Strategy on Multimodal Test Functions , 2004, PPSN.

[5]  László Tóth,et al.  Training HMM/ANN Hybrid Speech Recognizers by Probabilistic Sampling , 2005, ICANN.

[6]  Alessandro Vinciarelli,et al.  Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[7]  Rich Caruana,et al.  Obtaining Calibrated Probabilities from Boosting , 2005, UAI.

[8]  Dong Yu,et al.  Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Berkman Sahiner,et al.  Calibration of medical diagnostic classifier scores to the probability of disease , 2016, Statistical methods in medical research.

[10]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[11]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Lei Zhang,et al.  Measuring and enhancing the transferability of hidden Markov models for dynamic travel behavioral analysis , 2020 .

[14]  Gábor Gosztolya,et al.  On evaluation metrics for social signal detection , 2015, INTERSPEECH.

[15]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[16]  Daniel Povey,et al.  Improved discriminative training techniques for large vocabulary continuous speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Brendan T. O'Connor,et al.  Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.

[18]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[19]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[20]  Merlin Suarez,et al.  Building a Multimodal Laughter Database for Emotion Recognition , 2012, LREC.

[21]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[22]  Gábor Gosztolya,et al.  GMM-Free Flat Start Sequence-Discriminative DNN Training , 2016, INTERSPEECH.

[23]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[24]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[25]  Tatsuya Kawahara,et al.  Social Signal Detection in Spontaneous Dialogue Using Bidirectional LSTM-CTC , 2017, INTERSPEECH.

[26]  László Tóth Phone recognition with hierarchical convolutional deep maxout networks , 2015, EURASIP J. Audio Speech Music. Process..

[27]  Björn W. Schuller,et al.  Detecting Vocal Irony , 2017, GSCL.

[28]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[29]  Björn W. Schuller,et al.  Manual versus Automated: The Challenging Routine of Infant Vocalisation Segmentation in Home Videos to Study Neuro(mal)development , 2016, INTERSPEECH.

[30]  Régis Cardin,et al.  MMIE training for large vocabulary continuous speech recognition , 1994, ICSLP.

[31]  Gábor Gosztolya,et al.  Calibrating AdaBoost for phoneme classification , 2018, Soft Comput..

[32]  Seyed-Ahmad Ahmadi,et al.  V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).