Glottal Flow Synthesis for Whisper-to-Speech Conversion

Whisper-to-speech conversion is motivated by laryngeal disorders, in which malfunction of the vocal folds leads to loss of voicing. Many patients with laryngeal disorders can still produce functional whispers, since these are characterised by the absence of vocal fold vibration. Whispers therefore constitute a common ground for speech rehabilitation across many kinds of laryngeal disorder. Whisper-to-speech conversion involves recreating natural-sounding speech from recorded whispers, and is a non-invasive and non-surgical rehabilitation that can maintain a natural method of speaking, unlike the existing methods of rehabilitation. This article proposes a new rule-based method for whisper-to-speech conversion that replaces the noisy whisper sound source with a synthesised speech-like harmonic source, while maintaining the vocal tract component unaltered. In particular, a novel glottal source generator is developed in which whisper information is used to parameterise the excitation through a high-quality glottis model. Evaluation of the system against the standard pulse train excitation method reveals significantly improved performance. Since our method is glottis-based, it is potentially compatible with the many existing vocal tract component adaptation systems.

[1]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[2]  Joan Serra,et al.  Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks , 2018, IberSPEECH.

[3]  Axel Röbel,et al.  Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis , 2013, Speech Commun..

[4]  Ian Vince McLoughlin,et al.  Reconstruction of Normal Sounding Speech for Laryngectomy Patients Through a Modified CELP Codec , 2010, IEEE Transactions on Biomedical Engineering.

[5]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[6]  Hanjun Liu,et al.  Electrolarynx in voice rehabilitation. , 2007, Auris, nasus, larynx.

[7]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[8]  Mark A. Clements,et al.  Reconstruction of speech from whispers , 2002, MAVEBA.

[9]  Christophe d'Alessandro,et al.  Cantor Digitalis: chironomic parametric synthesis of singing , 2017, EURASIP Journal on Audio, Speech, and Music Processing.

[10]  Ian Vince McLoughlin,et al.  Line spectral pairs , 2008, Signal Process..

[11]  Paavo Alku,et al.  Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering , 1991, Speech Commun..

[12]  Ian McLoughlin,et al.  Speech reconstruction using a deep partially supervised neural network , 2017, Healthcare technology letters.

[13]  Tomoki Toda,et al.  Alaryngeal Speech Enhancement Based on One-to-Many Eigenvoice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  M F Schwartz Power spectral density measurements of oral and whispered speech. , 1970, Journal of speech and hearing research.

[15]  Xu Shao,et al.  Pitch prediction from MFCC vectors for speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Masanori Sugimoto,et al.  Whisper to normal speech conversion using pitch estimated from spectrum , 2016, Speech Commun..

[17]  Nathalie Henrich Bernardoni,et al.  The spectrum of glottal flow models , 2006 .

[18]  Ian Vince McLoughlin,et al.  Spectral Enhancement of Whispered Speech Based on Probability Mass Function , 2010, 2010 Sixth Advanced International Conference on Telecommunications.

[19]  Kazuya Takeda,et al.  Analysis and recognition of whispered speech , 2005, Speech Commun..

[20]  Boon Pang Lim,et al.  Computational differences between whispered and non-whispered speech , 2011 .

[21]  Junichi Yamagishi,et al.  An experimental comparison of multiple vocoder types , 2013, SSW.

[22]  Tomoki Toda,et al.  Predicting F0 and voicing from NAM-captured whispered speech , 2008, Speech Prosody 2008.

[23]  Ian McLoughlin,et al.  GFM-Voc: A Real-Time Voice Quality Modification System , 2019, INTERSPEECH.

[24]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[25]  Yan Song,et al.  Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation , 2015, ACM Trans. Access. Comput..

[26]  Tomoki Toda,et al.  Improvement to a NAM-captured whisper-to-speech system , 2010, Speech Commun..

[27]  Lauri Juvela,et al.  Phase perception of the glottal excitation and its relevance in statistical parametric speech synthesis , 2016, Speech Commun..

[28]  Abeer Alwan,et al.  A novel codebook search technique for estimating the open quotient , 2009, INTERSPEECH.

[29]  H Levitt,et al.  Consonant-vowel intensity ratios for maximizing consonant recognition by hearing-impaired listeners. , 1998, The Journal of the Acoustical Society of America.

[30]  D G Childers,et al.  Vocal quality factors: analysis, synthesis, and perception. , 1991, The Journal of the Acoustical Society of America.

[31]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[32]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[33]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[34]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[35]  Mark J. F. Gales,et al.  Complex cepstrum for statistical parametric speech synthesis , 2013, Speech Commun..

[36]  Tanja Schultz,et al.  Fundamental frequency generation for whisper-to-audible speech conversion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  W. Heeren Vocalic correlates of pitch in whispered versus normal speech. , 2015, The Journal of the Acoustical Society of America.

[38]  Paavo Alku,et al.  Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Ian McLoughlin,et al.  Whisper-to-speech conversion using restricted Boltzmann machine arrays , 2014 .

[40]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[41]  Hamid Reza,et al.  Voiced Speech from Whispers for Post-Laryngectomised Patients , 2009 .

[42]  Kimitaka Kaga,et al.  Alaryngeal speech aid using an intra-oral electrolarynx and a miniature fingertip switch. , 2005, Auris, nasus, larynx.

[43]  C. Sinclair,et al.  The electrolarynx: voice restoration after total laryngectomy , 2017, Medical devices.

[44]  M. Morris,et al.  Prevalence and etiologies of adult communication disabilities in the United States: Results from the 2012 National Health Interview Survey. , 2016, Disability and health journal.

[45]  Prasanta Kumar Ghosh,et al.  Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs , 2018, INTERSPEECH.

[46]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[47]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[48]  Kazuya Takeda,et al.  Acoustic analysis and recognition of whispered speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  H. Kameoka,et al.  Physically Constrained Statistical F0 Prediction for Electrolaryngeal Speech Enhancement , 2017, INTERSPEECH.

[50]  I. Mcloughlin,et al.  A comprehensive vowel space for whispered speech. , 2012, Journal of voice : official journal of the Voice Foundation.

[51]  Johan Liljencrants,et al.  Voice source parameters in continuous speech, transformation of LF-parameters , 1994, ICSLP.

[52]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  D G Childers,et al.  Speech synthesis by glottal excited linear prediction. , 1994, The Journal of the Acoustical Society of America.

[54]  K. Kallail,et al.  An acoustic comparison of isolated whispered and phonated vowel samples produced by adult male subjects , 1984 .

[55]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[56]  B. Moore,et al.  Thresholds for the detection of inharmonicity in complex tones. , 1985, The Journal of the Acoustical Society of America.

[57]  Tuomo Raitio,et al.  Excitation modeling for HMM-based speech synthesis: Breaking down the impact of periodic and aperiodic components , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  M. Singer,et al.  A comparative acoustic study of normal, esophageal, and tracheoesophageal speech production. , 1984, The Journal of speech and hearing disorders.

[59]  Ian McLoughlin,et al.  A Spectral Glottal Flow Model for Source-filter Separation of Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[61]  Zhiyong Wu,et al.  A Review of Deep Learning Based Speech Synthesis , 2019, Applied Sciences.

[62]  Ian McLoughlin,et al.  Regeneration of Speech in Voice-Loss Patients , 2009 .

[63]  Paavo Alku,et al.  Comparing glottal-flow-excited statistical parametric speech synthesis methods , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  D. O'Shaughnessy,et al.  Linear predictive coding , 1988, IEEE Potentials.

[65]  Per Hedelin A glottal LPC-vocoder , 1984, ICASSP.

[66]  Junichi Yamagishi,et al.  Glottal Spectral Separation for Speech Synthesis , 2014, IEEE Journal of Selected Topics in Signal Processing.

[67]  Olivier Rosec,et al.  A New Method for Speech Synthesis and Transformation Based on an ARX-LF Source-Filter Decomposition and HNM Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[68]  Method for the subjective assessment of intermediate quality level of , 2014 .