Statistical properties of the warped discrete cosine transform cepstrum compared with MFCC

Abstract In this paper, we continue our investigation of the warpeddiscrete cosine transform cepstrum (WDCTC), which wasearlier introduced as a new speech processing feature [1].Here, we study the statistical properties of the WDCTC andcompare them with the mel-frequency cepstral coefficients(MFCC). We report some interesting properties of the WD-CTC when compared to the MFCC: its statistical distrib-ution is more Gaussian-like with lower variance, it obtainsbettervowel clusterseparability,it formstightervowelclus-ters and generates better codebooks. Further, we employthe WDCTC and MFCC features in a 5-vowel recognitiontask using Vector Quantization (VQ) and 1-Nearest Neigh-bour(1-NN)as classifiers. Inourexperiments,the WDCTCconsistently outperforms the MFCC. 1. Introduction We recently introduced the warped discrete cosine trans-form cepstrum (WDCTC) as a new speech processing fea-ture and demonstrated its better performance than the mel-frequency cepstral coefficients (MFCC) in a vowel recog-nition and speaker-identification task [1]. The WDCTC hasshown good promise as a speech processing feature and weare encouraged to further investigate the WDCTC featureand its statistical properties.Alargevolumeoftrainingdataisrequiredto buildspea-ker-independentspeechrecognitionsystems. Onetechniqueof reducing the data size is clustering the data and choos-ing a reasonable number of representative feature vectorsto form codebooks [2]. Hence, codebook techniques arevery relevant and practical to speech recognition systems.We form WDCTC and MFCC codebooks using a k-meansclustering algorithm and compare the codebook statisticsfor clean and noisy vowels using the coefficient of varianceand overlap ratio (defined later). Our experiment demon-strates that the WDCTC codebooks represent the underly-ing vowel data better than MFCC.In order to compare the classification capability of thefeatures, the WDCTC and MFCC are employed in a 5-vowel recognition task. Vector quantization (VQ) and 1-nearestneighbor(1-NN,[2])are usedas classifiersandtheirrecognition performance is reported. We also investigatethe clean and noisy vowel clusters formed by WDCTC andMFCC features and present the average separability of thevowel classes.

[1]  Stephen A. Martucci,et al.  Symmetric convolution and the discrete sine and cosine transforms , 1993, IEEE Trans. Signal Process..

[2]  Vishwa Gupta,et al.  Decision rules for speaker-independent isolated word recognition , 1984, ICASSP.

[3]  Rangarao Muralishankar,et al.  Pseudo Complex Cepstrum Using Discrete Cosine Transform , 2005, Int. J. Speech Technol..

[4]  Sanjit K. Mitra,et al.  Warped discrete cosine transform and its application in image compression , 2000, IEEE Trans. Circuits Syst. Video Technol..

[5]  Detlev Langmann,et al.  A comparative study of linear feature transformation techniques for automatic speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Julius O. Smith,et al.  Bark and ERB bilinear transforms , 1999, IEEE Trans. Speech Audio Process..

[7]  Douglas D. O'Shaughnessy,et al.  Warped discrete cosine transform cepstrum: A new feature for speech processing , 2005, 2005 13th European Signal Processing Conference.