Pitch-synchronous processing of speech signal for improving the quality of low bit rate speech coders
暂无分享,去创建一个
Recent advances in low bit rate speech coding have resulted in a family of speech coders with very good quality and high intelligibility in the past decade. However, these state-of-the-art coders still have problems in modeling and encoding of the transition regions in the speech signal. This problem not only affects the overall quality of these coders but also prohibits further quality improvements with increasing bit rate. One of the reasons for this failure is the stationary assumption used in the estimation of the model parameters in these regions. For these cases, the parameter estimates are often flawed, and the use of these parameters in synthesis may result in audible artifacts.
This thesis introduces new methods that can capture the spectral characteristics efficiently from the shortest possible speech signal segments—individual pitch cycles. As a result, it is possible to capture the perceptually important information in both stationary and transition regions. For this purpose, this thesis presents a new class of linear-prediction methods and a residual signal representation method that both use single pitch cycle. A new 2.4 kb/s speech coder is also developed using these proposed methods. Listening tests proved that this new coder performs better than the current state-of-the-art speech coder, especially for female speech.
In addition, the quality of the current parametric speech coders is improved by encoding the waveform of the excitation signal in transition regions. For this purpose, a new algorithm that modifies the original signal such that it becomes time-synchronous with the synthetic signal and the waveform of both signals become similar is introduced. This new algorithm allows the use of both fully-parametric representation and waveform encoding of the excitation signal in the same coder to encode different parts of the speech signal. Listening tests have proved that the speech coders using this method have better synthetic speech quality.