Speech analysis and synthesis with a refined adaptive sinusoidal representation

This paper explores common speech signal representations along with a brief description of their corresponding analysis–synthesis stages. The main focus is on adaptive sinusoidal representations where a refined model of speech is suggested. This model is referred to as Refined adaptive Sinusoidal Representation (R_aSR). Based on the performance of the recently suggested adaptive Sinusoidal Models of speech, significant refinements are proposed at both the analysis and adaptive stages. First, a quasi-harmonic representation of speech is used in the analysis stage in order to obtain an initial estimation of the instantaneous model parameters. Next, in the adaptive stage, an adaptive scheme combined with an iterative frequency correction mechanism is used to allow a robust estimation of model parameters (amplitudes, frequencies, and phases). Finally, the speech signal is reconstructed as a sum of its estimated time-varying instantaneous components after an interpolation scheme. Objective evaluation tests prove that the suggested R_aSR achieves high quality reconstruction when applied in modeling voiced speech signals compared to state-of-the-art models. Moreover, transparent perceived quality was attained using the R_aSR according to results obtained from listening evaluation tests.

[1]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[2]  Luís B. Almeida,et al.  Variable-frequency synthesis: An improved harmonic coding scheme , 1984, ICASSP.

[3]  Yannis Stylianou,et al.  On the Modeling of Voiceless Stop Sounds of Speech using Adaptive Quasi-Harmonic Models , 2012, INTERSPEECH.

[4]  T. F. Quatieri,et al.  Audio Signal Processing Based on Sinusoidal Analysis/Synthesis , 2002 .

[5]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[6]  Athanasios Mouchtaris,et al.  Analysis of emotional speech using an adaptive sinusoidal model , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[7]  Tabet Youcef,et al.  A Tutorial on Speech Synthesis Models , 2015 .

[8]  Jae S. Lim,et al.  Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[9]  Yannis Stylianou,et al.  A full-band adaptive harmonic representation of speech , 2012, INTERSPEECH.

[10]  Yannis Stylianou,et al.  Pitch modifications of speech based on an adaptive Harmonic Model , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yannis Stylianou,et al.  Analysis/synthesis of speech based on an adaptive quasi-harmonic plus noise model , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Juan Carlos,et al.  Review of "Discrete-Time Speech Signal Processing - Principles and Practice", by Thomas Quatieri, Prentice-Hall, 2001 , 2003 .

[13]  Yannis Stylianou,et al.  Time-scale modifications based on a full-band adaptive harmonic model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[15]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[16]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[17]  Yannis Stylianou,et al.  An extension of the adaptive Quasi-Harmonic Model , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Nawar Halabi,et al.  Modern standard Arabic phonetics for speech synthesis , 2016 .

[19]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[20]  Yannis Stylianou,et al.  On the properties of a time-varying quasi-harmonic model of speech , 2008, INTERSPEECH.

[21]  Yannis Stylianou,et al.  Adaptive AM–FM Signal Decomposition With Application to Speech Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Yannis Stylianou,et al.  HNM: a simple, efficient harmonic+noise model for speech , 1993, Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[23]  Yannis Stylianou,et al.  High-resolution sinusoidal modeling of unvoiced speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Yannis Stylianou,et al.  Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Per Hedelin A tone oriented voice excited vocoder , 1981, ICASSP.

[26]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[27]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[28]  Werner Oomen,et al.  Sinusoids Plus Noise Modelling for Audio Signals , 1999 .

[29]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[30]  Yannis Stylianou,et al.  Robust full-band adaptive Sinusoidal analysis and synthesis of speech , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Thomas F. Quatieri,et al.  Magnitude-only reconstruction using a sinusoidal speech modelMagnitude-only reconstruction using a sinusoidal speech model , 1984, ICASSP.