Social Signal Detection by Probabilistic Sampling DNN Training

When our task is to detect social signals such as laughter and filler events in an audio recording, the most straightforward way is to apply a Hidden Markov Model–or a Hidden Markov Model/Deep Neural Network (HMM/DNN) hybrid, which is considered state-of-the-art nowadays. In this hybrid model, the DNN component is trained on frame-level samples of the classes we are looking for. In such event detection tasks, however, the training labels are seriously imbalanced, as typically only a small fraction of the training data corresponds to these social signals, while the bulk of the utterances consists of speech segments or silence. A strong imbalance of the training classes is known to cause difficulties during DNN training. To alleviate these problems, here we apply the technique called probabilistic sampling, which seeks to balance the class distribution. Probabilistic sampling is a mathematically well-founded combination of upsampling and downsampling, which was found to outperform both of these simple resampling approaches. With this strategy, we managed to achieve a 7–8 percent relative error reduction both at the segment level and frame level, and we efficiently reduced the DNN training times as well.

[1]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[5]  László Tóth,et al.  Training HMM/ANN Hybrid Speech Recognizers by Probabilistic Sampling , 2005, ICANN.

[6]  Alessandro Vinciarelli,et al.  Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[7]  Lie Lu,et al.  Highlight sound effects detection in audio stream , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[9]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  J. C. Shaw,et al.  Programming the logic theory machine , 1899, IRE-AIEE-ACM '57 (Western).

[11]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[12]  Gábor Gosztolya,et al.  Training Context-Dependent DNN Acoustic Models Using Probabilistic Sampling , 2017, INTERSPEECH.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[15]  Yanxiong Li,et al.  Detecting laughter in spontaneous speech by constructing laughter bouts , 2011, Int. J. Speech Technol..

[16]  Karen Livescu,et al.  Triphone State-Tying via Deep Canonical Correlation Analysis , 2016, INTERSPEECH.

[17]  Carmen Peláez-Moreno,et al.  Data Balancing for Efficient Training of Hybrid ANN/HMM Automatic Speech Recognition Systems , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  John M. Pecarina,et al.  Improved Aircraft Recognition for Aerial Refueling Through Data Augmentation in Convolutional Neural Networks , 2016, ISVC.

[19]  Gábor Gosztolya Optimized Time Series Filters for Detecting Laughter and Filler Events , 2017, INTERSPEECH.

[20]  Björn W. Schuller,et al.  Hierarchical neural networks and enhanced class posteriors for social signal classification , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[21]  Gábor Gosztolya,et al.  On evaluation metrics for social signal detection , 2015, INTERSPEECH.

[22]  Ingo Siegert,et al.  Application of image processing methods to filled pauses detection from spontaneous speech , 2014, INTERSPEECH.

[23]  John H. L. Hansen,et al.  Laughter and filler detection in naturalistic audio , 2015, INTERSPEECH.

[24]  András Beke,et al.  Laughter Classification Using Deep Rectifier Neural Networks with a Minimal Feature Subset , 2016 .

[25]  Eva Navas,et al.  Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains , 2015, EURASIP J. Audio Speech Music. Process..

[26]  László Tóth Phone recognition with hierarchical convolutional deep maxout networks , 2015, EURASIP J. Audio Speech Music. Process..

[27]  Janet Holmes,et al.  Having a laugh at work: how humour contributes to workplace culture , 2002 .

[28]  Rahul Gupta,et al.  Detecting paralinguistic events in audio stream using context in features and probabilistic decisions , 2016, Comput. Speech Lang..

[29]  Carmen Peláez-Moreno,et al.  Automatic data selection for MLP-based feature extraction for ASR , 2005, INTERSPEECH.

[30]  Sadik Fikret Gürgen,et al.  Random Forests for Laughter Detection , 2013 .

[31]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[32]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[33]  Björn W. Schuller,et al.  Social signal classification using deep blstm recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Róbert Busa-Fekete,et al.  Detecting autism, emotions and social signals using adaboost , 2013, INTERSPEECH.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Ah Chung Tsoi,et al.  Neural Network Classification and Prior Class Probabilities , 1996, Neural Networks: Tricks of the Trade.

[37]  Horst Bunke,et al.  Off-Line, Handwritten Numeral Recognition by Perturbation Method , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Ingo Siegert,et al.  Discourse Particles in Human-Human and Human-Computer Interaction - Analysis and Evaluation , 2016, HCI.

[39]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[40]  Ingo Siegert,et al.  Discourse Particles and User Characteristics in Naturalistic Human-Computer Interaction , 2014, HCI.

[41]  Daniel P. W. Ellis,et al.  Laughter Detection in Meetings , 2004 .

[42]  Chao Zhang,et al.  A general artificial neural network extension for HTK , 2015, INTERSPEECH.

[43]  J. Trouvain Phonetic Aspects of "Speech-Laughs" , 2001 .

[44]  András Beke,et al.  Automatic laughter detection in Hungarian spontaneous speech using GMM/ANN hybrid method , 2013 .

[45]  Richard M. Schwartz,et al.  Two-Stage Data Augmentation for Low-Resourced Speech Recognition , 2016, INTERSPEECH.

[46]  Gábor Gosztolya,et al.  Building context-dependent DNN acoustic models using Kullback-Leibler divergence-based state tying , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[49]  Björn W. Schuller,et al.  Manual versus Automated: The Challenging Routine of Infant Vocalisation Segmentation in Home Videos to Study Neuro(mal)development , 2016, INTERSPEECH.

[50]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..