Fractional Fourier transform based auditory feature for language identification

In this paper, a novel auditory feature based on fractional Fourier transform (FRFT), namely, fractional auditory cepstrum coefficient (FACC), is presented for language identification (LID). Different from the widely used Mel-frequency cepstrum coefficient (MFCC), the proposed feature utilizes the human auditory model and performs Gammatone filtering for the short-time fractional spectrum of the speech. Experimental results on NIST 2003 Language Recognition Evaluation (LRE03) show that the FACC feature decreases the equal error rate (EER) of 10.5% relatively when compared with the MFCC feature.

[1]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[2]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[3]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[4]  Z. Zalevsky,et al.  The Fractional Fourier Transform: with Applications in Optics and Signal Processing , 2001 .

[5]  Waleed H. Abdulla,et al.  Auditory Based Feature Vectors for Speech Recognition Systems , 2002 .

[6]  Calvin Nkadimeng Language Identification Using Gaussian Mixture Models , 2010 .

[7]  Weiqiang Zhang,et al.  Auditory features with vocal track length normalization for language identification , 2008, 2008 International Conference on Audio, Language and Image Processing.

[8]  Cagatay Candan,et al.  The discrete fractional Fourier transform , 2000, IEEE Trans. Signal Process..

[9]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[10]  Frank K. Soong,et al.  A high-performance auditory feature for robust speech recognition , 2000, INTERSPEECH.

[11]  Roy D. Patterson,et al.  Auditory images:How complex sounds are represented in the auditory system , 2000 .

[12]  George Saon,et al.  Fractional Fourier transform features for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Douglas A. Reynolds,et al.  Language identification using Gaussian mixture model tokenization , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[15]  Ran Tao,et al.  Sampling and Sampling Rate Conversion of Band Limited Signals in the Fractional Fourier Transform Domain , 2008, IEEE Transactions on Signal Processing.

[16]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[17]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..