Real-time Hardware Feature Extraction with Embedded Signal Enhancement for Automatic Speech Recognition

The concept of using speech for communicating with computers and other machines has been the vision of humans for decades. User input via speech promises overwhelming advantages compared with standard input/output peripherals, such as, mouse, keyboard, and buttons. To make this vision a reality, considerable effort and investment into automatic speech recognition (ASR) research has been conducted for over six decades. While current speech recognition systems perform very well in benign environments, their performance is rather limited inmany real-world settings. One of the main degrading factors in these systems is background noise collected along with the wanted speech. There are a wide range of possible uncorrelated noise sources. They are generally short lived and non-stationary. For example in the automotive environments, noise sources can be road noise, engine noise, or passing vehicles that compete with the speech. Noise can also be continuous, such as, wind noise, particularly from an open window, or noise from a ventilation or air conditioning unit. To make speech recognition systems more robust, there are a number of methods being investigated. These include the use of robust feature extraction and recognition algorithms as well as speech enhancement. Enhancement techniques aim to remove (or at least reduce) the levels of noise present in the speech signals, allowing clean speech models to be utilised in the recognition stage. This is a popular approach as little-or-no prior knowledge of the operating environment is required for improvements in recognition accuracy. While many ASR and enhancement algorithms or models have been proposed, an issue of how to implement them efficiently still remains. Many software implementations of the algorithms exist, but they are limited in application as they require relatively powerful general purpose processors. To achieve a real-time design with both low-cost and high performance, a dedicated hardware implementation is necessary. This chapter presents the design of a Real-time Hardware Feature Extraction System with Embedded Signal Enhancement for Automatic Speech Recognition appropriate for implementation in low-cost Field Programmable Gate Array (FPGA) hardware. While suitable for many other applications, the design inspiration was for automotive applications, requiring real-time, low-cost hardware without sacrificing performance. Main components of this design are: an efficient implementation of the Discrete Fourier Transform (DFT), speech enhancement, and Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction. 2

[1]  Hanseok Ko,et al.  Background noise reduction via dual-channel scheme for speech recognition in vehicular environment , 2005, 2005 Digest of Technical Papers. International Conference on Consumer Electronics, 2005. ICCE..

[2]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[3]  Tristan Kleinschmidt,et al.  Robust speech recognition using speech enhancement , 2010 .

[4]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[5]  Michael Mason,et al.  Small footprint implementation of dual-microphone delay-and-sum beamforming for in-car speech enhancement , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  M. Mason,et al.  FPGA implementation of spectral subtraction for in-car speech enhancement and recognition , 2008, 2008 2nd International Conference on Signal Processing and Communication Systems.

[7]  Ea-Ee Jan,et al.  Microphone arrays and speaker identification , 1994, IEEE Trans. Speech Audio Process..

[8]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[9]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[10]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[11]  Robert H. Baran,et al.  Dual channel based speech enhancement using novelty filter for robust speech recognition in automobile environment , 2006, IEEE Transactions on Consumer Electronics.

[12]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[13]  Sridha Sridharan,et al.  A Continuous Speech Recognition Evaluation Protocol for the AVICAR Database , 2008 .

[14]  Hua Ye,et al.  Implementation of the MFCC front-end for low-cost speech recognition systems , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[15]  Jacob Benesty,et al.  Speech Enhancement , 2010 .

[16]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[17]  Oliver Chiu-sing Choy,et al.  An efficient MFCC extraction method in speech recognition , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[18]  Jhing-Fa Wang,et al.  Chip design of MFCC extraction for speech recognition , 2002, Integr..

[19]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[20]  Klaus Uwe Simmer,et al.  Superdirective Microphone Arrays , 2001, Microphone Arrays.

[21]  Parham Aarabi,et al.  Phase-based dual-microphone robust speech enhancement , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[22]  Michael Mason,et al.  FPGA implementation of spectral subtraction for automotive speech recognition , 2009, 2009 IEEE Workshop on Computational Intelligence in Vehicles and Vehicular Systems.

[23]  Javier Ortega-Garcia,et al.  Overview of speech enhancement techniques for automatic speaker recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.