论文信息 - GCI Detection from Raw Speech Using a Fully-Convolutional Network

GCI Detection from Raw Speech Using a Fully-Convolutional Network

Glottal Closure Instants (GCI) detection consists in automatically detecting temporal locations of most significant excitation of the vocal tract from the speech signal. It is used in many speech analysis and processing applications, and various algorithms have been proposed for this purpose. Recently, new approaches using convo-lutional neural networks have emerged, with encouraging results. Following this trend, we propose a simple approach that performs a mapping from the speech waveform to a target signal from which the GCIs are obtained by peak-picking. However, the ground truth GCIs used for training and evaluation are usually extracted from EGG signals, which are not perfectly reliable and often not available. To overcome this problem, we propose to train our network on high-quality synthetic speech with perfect ground truth. The performances of the proposed algorithm are compared with three other state-of-the-art approaches using publicly available datasets, and the impact of using controlled synthetic or real speech signals in the training stage is investigated. The experimental results demonstrate that the proposed method obtains similar or better results than other state-of-the-art algorithms and that using large synthetic datasets with many speakers offers a better generalization ability than using a smaller database of real speech and EGG signals.

Axel Roebel | Luc Ardaillon

[1] Thierry Dutoit,et al. The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Patrick A. Naylor,et al. Estimation of Glottal Closing and Opening Instants in Voiced Speech Using the YAGA Algorithm , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Varun Srivastava,et al. Detection of Glottal Closure Instants from Raw Speech Using Convolutional Neural Networks , 2019, INTERSPEECH.

[4] Patrick A. Naylor,et al. The SIGMA Algorithm: A Glottal Activity Detector for Electroglottographic Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Axel Röbel,et al. On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system , 2015, INTERSPEECH.

[6] K. Sreenivasa Rao,et al. Glottal Closure Instants Detection From Pathological Acoustic Speech Signal Using Deep Learning , 2018, ArXiv.

[7] Axel Roebel,et al. Fully-Convolutional Network for Pitch Estimation of Speech Signals , 2019, INTERSPEECH.

[8] Alan W. Black,et al. The CMU Arctic speech databases , 2004, SSW.

[9] Daniel Tihelka,et al. Classification-Based Detection of Glottal Closure Instants from Speech Signals , 2017, INTERSPEECH.

[10] Maxine Eskénazi,et al. Design considerations and text selection for BREF, a large French read-speech corpus , 1990, ICSLP.

[11] M. Sabarimalai Manikandan,et al. Effective Glottal Instant Detection and Electroglottographic Parameter Extraction for Automated Voice Pathology Assessment , 2018, IEEE Journal of Biomedical and Health Informatics.

[12] Abeer Alwan,et al. Glottal source processing: From analysis to applications , 2014, Comput. Speech Lang..

[13] Zhiyong Wu,et al. Detection of Glottal Closure Instants from Speech Signals: A Convolutional Neural Network Based Method , 2018, INTERSPEECH.

[14] J. Liljencrants,et al. Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[15] John Kane,et al. COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Luc Ardaillon. Synthesis and expressive transformation of singing voice , 2017 .

[17] Bayya Yegnanarayana,et al. Extracting formants from short segments of speech using group delay functions , 2006, INTERSPEECH.

[18] Bayya Yegnanarayana,et al. Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Patrick A. Naylor,et al. Detection of Glottal Closure Instants From Speech Signals: A Quantitative Review , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20] Mike Brookes,et al. Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21] A. G. Ramakrishnan,et al. Epoch Extraction Based on Integrated Linear Prediction Residual Using Plosion Index , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[22] Franz Pernkopf,et al. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario , 2011, INTERSPEECH.

[23] Khalid Daoudi,et al. Detection of Glottal Closure Instants Based on the Microcanonical Multiscale Formalism , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] Thierry Dutoit,et al. A quantitative comparison of glottal closure instant estimation algorithms on a large variety of singing sounds , 2013, INTERSPEECH.

[25] Thierry Dutoit,et al. Glottal closure and opening instant detection from speech signals , 2019, INTERSPEECH.

[26] Jong Wook Kim,et al. Crepe: A Convolutional Representation for Pitch Estimation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Bayya Yegnanarayana,et al. Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[28] Mayank Mishra,et al. Adversarial Approximate Inference for Speech to Electroglottograph Conversion , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29] Victor Zue,et al. Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31] Tomi Kinnunen,et al. Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks , 2018, INTERSPEECH.

[32] Axel Röbel,et al. Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..