Speech analysis and quality enhancement using higher order cumulants

This thesis presents robust methods for speech analysis and enhancement based on newly established properties of the higher order cumulants (HOC) of speech signals. In the exploratory part of this work, it is shown that the HOC of speech are non-zero and may be expressed in terms of speech parameters, such as energy, harmonic amplitudes and frequencies. These properties are established assuming a sinusoidal model for speech, and considering two specific domains, namely a subband representation and the linear predictive coding (LPC) residual. The issues pertaining to the bias and variance of the HOC estimators are examined and in the case of a sinusoid in white Gaussian noise, these entities are quantified in terms of the process variance. An algorithm is proposed for computing the 3rd-order cumulant with a reduced number of multiplications, and a scheduling algorithm is proposed to map a set of DSP operations on a configurable multi-unit architecture. General properties relating 2nd and higher order statistics (HOS) in the frequency domain are derived, such as the recovery of the Fourier magnitude spectrum from the bispectrum. The application part of this work exploits the HOC properties thus established and the limitations identified to build two algorithms, the first for quality enhancement and the second for voice activity detection. The algorithm for speech enhancement uses subband domain optimal filters based on a minimum mean square error criteria (MMSE) to recover the speech signal from the noisy observation. The key idea is to use the 4th-order cumulant of the noisy speech to estimate the parameters required for the filters, namely the 2nd-order statistics of the speech and noise as well as the probability of speech presence. The algorithm proposed for voice activity detection (VAD) combines HOS metrics and SNR measures to classify frames as speech or noise, using the LPC residual. A voicing condition for speech frames is derived based on the relation between the skewness and kurtosis of voiced speech. In addition, the variance of the HOS estimators is used to yield a likelihood measure for noise frames. The two algorithms developed demonstrate that, in spite of the practical limitations of using these cumulants and the approximate nature of the speech model assumed, effective application of HOC is possible. By making use of only HOC measures, the performance of these algorithms is shown to be comparable, even better in some respects, to the current standards. As this is the first iteration of this type of approach, it clearly demonstrates the promising potential of HOC in yielding algorithms that would surpass the current state of the art.