Abstract — In this paper, a Korean large vocabulary speech recognizer for an embedded car navigation device is introduced. The proposed speech recognizer identifies 450k point-of-interests within a resource-limited device without serious performance degradation under severe car-noise environments. Before launching the speech recognition application on the Korean retail market, a series of speech recognition tests are conducted in various moving vehicles including sport utility vehicles, recreation vehicles and so on. The on-line 10-best evaluation results of 450k point-of-interest name recognition task show 84.2% accuracy under various driving conditions. I. quality of the conventional WienerI NTRODUCTION If you are not familiar with the user interface for a small embedded device, it is troublesome to push the alphabet buttons on its touchpad in order to get some information. When it should be done at the same time while driving, it could increase driver distraction. This kind of distraction could make a driver be at risk of being in an accident. The progress in speech recognition technology over the last decades makes it possible to control in-car devices by voice. A driver can select a song to play on an MP3 player or make a phone call using voice control technology while driving, just to name a few. In 2008, Nuance study showed that voice recognition increased car safety by alleviating driver’s distraction. But the application of speech recognition technology still remains in relatively small vocabulary size(<60k) because of the limited resource of a current in-car device. In this paper, we introduce a large vocabulary automatic speech recognizer(ASR) which is embedded in a car navigation device and helps a driver to find a route to destination. Despite of recent progress in speech recognition technology, speech recognition performance can be degraded in severe car-noise environments [1]. If a microphone is located some distance away from a user such as a hands-free application, current automatic speech recognition technology still has difficulty in satisfying user’s needs for recognition accuracy. Therefore, it is very important that a speech recognizer should be able to cope with car noises in order to commercialize voice-enabled in-car devices. To do so, a speech preprocessor should be able to remove ambient noise components without speech distortion and detect speech boundaries by identifying speech portions. In addition, acoustic models(AMs) based on Hidden Markov Model(HMM) [2] should be effectively adapted to various noisy conditions which are made by a moving car. In addition, an efficient design scheme of a speech recognition decoder is required in order to apply a large vocabulary speech recognition technology into a resource-limited device. We propose a large vocabulary speech recognizer which is robust especially in car environment. The speech preprocessor in our system deals successfully with in-vehicle noise by the help of three component technologies: a single channel speech enhancement method based on the widespread Wiener filter [6], an end-point detection(EPD) method with the proposed car -noise robust feature, and the proposed simple speech/nonspeech discrimination method. In order to improve speech filter [6] for a voice recognizer, a power spectral density(PSD) estimator based on the human auditory model [3] and a voice activity detector(VAD) based on global speech absence probability(GSAP) [10] are proposed. And in order to recognize hundreds of thousands of point-of-interest(POI) names in an embedded device without serious increase of computational complexity and memory requirement, a two-stage decoder based on the human speech recognition(HSR) architecture is adopted [4,5]. In order to notify a hostile driving condition against voice recognition to a user, we propose an environment change detector(ECD) which can estimate on-line signal-to-noise ratio(SNR) and play a warning message on low SNR input signal. This paper is organized as follows. After describing the proposed speech preprocessor and ECD in Section II, the fast and memory-efficient speech recognition decoding scheme is briefly introduced in Section III. In Section IV, the developed voice-enabled navigator is described. The speech recognition accuracy evaluation results for real data in various driving conditions are described in Section V before the conclusions in Section VI. II. T
[1]
Christian Wellekens,et al.
DISTBIC: A speaker-based segmentation for audio data indexing
,
2000,
Speech Commun..
[2]
渡辺馨.
Objective measurement method of audio quality in accordance with ITU-R Recommendation BS. 1387
,
2001
.
[3]
Odette Scharenborg,et al.
Parallels between HSR and ASR: how ASR can contribute to HSR
,
2005,
INTERSPEECH.
[4]
Denis Jouvet,et al.
Evaluation of a noise-robust DSR front-end on Aurora databases
,
2002,
INTERSPEECH.
[5]
Ho-Young Jung,et al.
Discriminative noise adaptive training approach for an environment migration
,
2007,
INTERSPEECH.
[6]
Yifan Gong,et al.
Speech recognition in noisy environments: A survey
,
1995,
Speech Commun..
[7]
RECOMMENDATION ITU-R BS.1387-1 - Method for objective measurements of perceived audio quality
,
2002
.
[8]
Lawrence R. Rabiner,et al.
On the use of autocorrelation analysis for pitch detection
,
1977
.
[9]
Hoon Chung,et al.
Memory efficient and fast speech recognition system for lowresource mobile devices
,
2006,
IEEE Transactions on Consumer Electronics.
[10]
Alex Bateman,et al.
An introduction to hidden Markov models.
,
2007,
Current protocols in bioinformatics.
[11]
Nam Soo Kim,et al.
Spectral enhancement based on global soft decision
,
2000,
IEEE Signal Processing Letters.