Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization

ABSTRACT The overall success of automatic speech recognition (ASR) depends on efficient phoneme recognition performance and quality of speech signal received in ASR. However, dissimilar inputs of speakers affect the overall recognition performance. One of the main problems that affect recognition performance is inter-speaker variability. Vocal tract length normalization (VTLN) is introduced to compensate inter-speaker variation on the speaker signal by applying speaker-specific warping of the frequency scale of a filter bank. Instead of measuring the performance on word level with speaker-specific warping, this research focuses on direct tackling at the phoneme level and applying VTLN on all speakers’ speech signals to analyse the best setting for the highest recognition performance. This research seeks to compare each phoneme recognition results from warping factor between 0.74 and 1.54 with 0.02 increments on nine different ranges of frequency warping boundary. The warp factor and frequency warping range that provides the highest phoneme recognition performance is applied on word recognition. The results show an improved performance in phoneme recognition by 0.7% and spoken word recognition by 0.5% using warp factor of 1.40 on frequency range of 300–5000 Hz in comparison to baseline results.

[1]  Jensen Jing Lung Wong,et al.  Implementation of vocal tract length normalization for phoneme recognition on timit speech corpus , 2011 .

[2]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3]  Amjad Rehman,et al.  Neural networks for document image preprocessing: state of the art , 2014, Artificial Intelligence Review.

[4]  S UMESH,et al.  Studies on inter-speaker variability in speech and its application in automatic speech recognition , 2011 .

[5]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[6]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Ivan Andonovic,et al.  A Review of Techniques for the Analysis of Simulation Output , 2012 .

[8]  Steve Young,et al.  The HTK book , 1995 .

[9]  Ming Liu,et al.  Frequency domain correspondence for speaker normalization , 2007, INTERSPEECH.

[10]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  P. Krishnamoorthy An Overview of Subjective and Objective Quality Measures for Noisy Speech Enhancement Algorithms , 2011 .

[12]  Amjad Rehman,et al.  Effects of artificially intelligent tools on pattern recognition , 2013, Int. J. Mach. Learn. Cybern..

[13]  Alfred Mertins,et al.  Improved warping-invariant features for automatic speech recognition , 2006, INTERSPEECH.

[14]  Amjad Rehman,et al.  Performance analysis of character segmentation approach for cursive script recognition on benchmark database , 2011, Digit. Signal Process..

[15]  RehmanAmjad,et al.  Neural networks for document image preprocessing , 2014 .

[16]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[17]  Tanzila Saba,et al.  Semantic analysis based forms information retrieval and classification , 2013 .

[18]  Douglas A. Reynolds,et al.  Improving phonotactic language recognition with acoustic adaptation , 2007, INTERSPEECH.

[19]  Sadaoki Furui,et al.  Generalization problem in ASR acoustic model training and adaptation , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  Sheikh Hussain Shaikh Salleh,et al.  Malay isolated speech recognition using neural network: a work in finding number of hidden nodes and learning parameters , 2011, Int. Arab J. Inf. Technol..

[21]  Amjad Rehman,et al.  Evaluation of Current Dental Radiographs Segmentation Approaches in Computer-aided Applications , 2013 .

[22]  Fabio Brugnara,et al.  Improved automatic speech recognition through speaker normalization , 2006, Comput. Speech Lang..

[23]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[24]  Steve Young,et al.  HMMs and related speech recognition technologies , 2008 .

[25]  Tanzila Saba,et al.  ANALYSIS OF VISION BASED SYSTEMS TO DETECT REAL TIME GOAL EVENTS IN SOCCER VIDEOS , 2013, Appl. Artif. Intell..

[26]  T. Saba,et al.  Off-line cursive script recognition: current advances, comparisons and remaining problems , 2012, Artificial Intelligence Review.

[27]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[28]  Amjad Rehman,et al.  Features extraction for soccer video semantic analysis: current achievements and remaining issues , 2012, Artificial Intelligence Review.

[29]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .