Formant Tracking Using Quasi-Closed Phase Forward-Backward Linear Prediction Analysis and Deep Neural Networks

Formant tracking is investigated in this study by using trackers based on dynamic programming (DP) and deep neural nets (DNNs). Using the DP approach, six formant estimation methods were first compared. The six methods include linear prediction (LP) algorithms, weighted LP algorithms and the recently developed quasi-closed phase forward-backward (QCP-FB) method. QCP-FB gave the best performance in the comparison. Therefore, a novel formant tracking approach, which combines benefits of deep learning and signal processing based on QCP-FB, was proposed. In this approach, the formants predicted by a DNN-based tracker from a speech frame are refined using the peaks of the all-pole spectrum computed by QCP-FB from the same frame. Results show that the proposed DNN-based tracker performed better both in detection rate and estimation error for the lowest three formants compared to reference formant trackers. Compared to the popular Wavesurfer, for example, the proposed tracker gave a reduction of 29%, 48%, and 35% in the estimation error for the lowest three formants, respectively.

[1]  Paavo Alku,et al.  Stabilised weighted linear prediction , 2009, Speech Commun..

[2]  Abeer Alwan,et al.  A Database of Vocal Tract Resonance Trajectories for Research in Speech Processing , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Y. Kamp,et al.  Robust signal selection for linear prediction analysis of voiced speech , 1993, Speech Commun..

[4]  Paavo Alku,et al.  Study of Formant Modification for Children ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Paavo Alku,et al.  Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Li Deng,et al.  A structured speech model with continuous hidden dynamics and prediction-residual training for tracking vocal tract resonances , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Robert Mores,et al.  Fast and robust formant detection from LP data , 2012, Speech Commun..

[8]  P F Assmann The role of formant transitions in the perception of concurrent vowels. , 1995, The Journal of the Acoustical Society of America.

[9]  Hermann Ney,et al.  Formant estimation for speech recognition , 1998, IEEE Trans. Speech Audio Process..

[10]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[11]  Chin-Hui Lee,et al.  On robust linear prediction of speech , 1988, IEEE Trans. Acoust. Speech Signal Process..

[12]  Ian C. Bruce,et al.  Robust Formant Tracking for Continuous Speech With Speaker Variability , 2003, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Prasanta Kumar Ghosh,et al.  Glottal Inverse Filtering Using Probabilistic Weighted Linear Prediction , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  P. Alku,et al.  Formant frequency estimation of high-pitched vowels using weighted linear prediction. , 2013, The Journal of the Acoustical Society of America.

[16]  Paavo Alku,et al.  Quasi Closed Phase Glottal Inverse Filtering Analysis With Weighted Linear Prediction , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  P. Wolfe,et al.  Kalman-based autoregressive moving average modeling and inference for formant and antiformant tracking a ) , 2011 .

[18]  Marc Moonen,et al.  Sparse Linear Prediction and Its Applications to Speech Processing , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Paavo Alku,et al.  Quasi-closed phase forward-backward linear prediction analysis of speech for accurate formant detection and estimation. , 2017, The Journal of the Acoustical Society of America.

[20]  Riichiro Mizoguchi,et al.  Speech analysis by selective linear prediction in the time domain , 1982, ICASSP.

[21]  Bhiksha Raj,et al.  Formant manipulations in voice disguise by mimicry , 2016, 2016 4th International Conference on Biometrics and Forensics (IWBF).

[22]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[23]  Joseph Keshet,et al.  Formant Estimation and Tracking Using Deep Learning , 2016, INTERSPEECH.

[24]  Raymond N. J. Veldhuis,et al.  Extraction of vocal-tract system characteristics from speech signals , 1998, IEEE Trans. Speech Audio Process..

[25]  Li Deng,et al.  Adaptive Kalman Filtering and Smoothing for Tracking Vocal Tract Resonances Using a Continuous-Valued Hidden Dynamic Model , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jacob Goldberger,et al.  Formant estimation and tracking: A deep learning approach. , 2019, The Journal of the Acoustical Society of America.

[27]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[28]  Hyeontaek Lim,et al.  Formant-Based Robust Voice Activity Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.