LSTM-Based Whisper Detection

This article presents a whisper speech detector in the far-field domain. The proposed system consists of a long-short term memory (LSTM) neural network trained on log-filterbank energy (LFBE) acoustic features. This model is trained and evaluated on recordings of human interactions with voice-controlled, far-field devices in whisper and normal phonation modes. We compare multiple inference approaches for utterance-level classification by examining trajectories of the LSTM posteriors. In addition, we engineer a set of features based on the signal characteristics inherent to whisper speech, and evaluate their effectiveness in further separating whisper from normal speech. A benchmarking of these features using multilayer perceptrons (MLP) and LSTMs suggests that the proposed features, in combination with LFBE features, can help us further improve our classifiers. We prove that, with enough data, the LSTM model is indeed as capable of learning whisper characteristics from LFBE features alone compared to a simpler MLP model that uses both LFBE and features engineered for separating whisper and normal speech. In addition, we prove that the LSTM classifiers accuracy can be further improved with the incorporation of the proposed engineered features.

[1]  John H. L. Hansen,et al.  Advancements in whisper-island detection within normally phonated audio streams , 2009, INTERSPEECH.

[2]  Stanley J. Wenndt,et al.  A study on the classification of whispered and normally phonated speech , 2002, INTERSPEECH.

[3]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .

[4]  Johnny B Wilson,et al.  A Comparative Analysis of Whispered and Normally Phonated Speech Using an LPC-10 Vocoder, , 1985 .

[5]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[6]  John H. L. Hansen,et al.  An entropy based feature for whisper-island detection within audio streams , 2008, INTERSPEECH.

[7]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[8]  Peder A. Olsen,et al.  Voicing features for robust speech detection , 2005, INTERSPEECH.

[9]  Hynek Hermansky,et al.  Acoustic and Data-driven Features for Robust Speech Activity Detection , 2012, INTERSPEECH.

[10]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[11]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[13]  Sree Hari Krishnan Parthasarathi,et al.  Anchored Speech Detection , 2016, INTERSPEECH.

[14]  Rathinavelu Chengalvarayan,et al.  Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition , 1999, EUROSPEECH.

[15]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[16]  Roland Maas,et al.  Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Abeer Alwan,et al.  Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics , 2019, INTERSPEECH.

[18]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[19]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[20]  Won-Ho Shin,et al.  Speech/non-speech classification using multiple features for robust endpoint detection , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).