A Speech Interface for a Mobile Robot controlled by GOLOG

With today’s high-level plan languages like GOLOG or rpl it is possible for mobile robots to cope with complex problem s. Unfortunately, instructing the robot what to do or interact ing with it is still awkward. Usually, instructions are given by loadin g the appropriate program and interacting amounts to little more than p ressing buttons positioned on the robot itself. The goal of this project is to offer a robust and easily expand able speech interface for GOLOG, implemented on a mobile robot. Using a headset, a user can instruct the robot to perform tasks f rom prespecified domains like mail or coffee delivery. Limited f orms of interaction are also supported. 1 Motivation and Goals With today’s high-level plan languages (like GOLOG [12, 15] or rpl [18]) it is possible for mobile robots to cope with complex pr oblems. Unfortunately, instructing the robot what to do or int eracting with it is still awkward. Usually, instructions are given by loading the appropriate program and interacting amounts to little m ore than pressing buttons positioned on the robot itself. The goal of this project [6] is to offer a robust and easily expandable speech interface for GOLOG, implemented on a mobile robot. Using a headset, a user can instruct the robot to perform task s from prespecified domains. Limited forms of interaction are lso supported. Among other things, the robot is able to deal with tasks like s erving coffee, delivering letters, or guiding people through a museum like his older brother RHINO [4]. To get an idea of the awkwardness in interacting with RHINO, l et us look at the museum guide application in a little more detai l. During a tour visitors had to advise their requests to RHINO by bu ttons. This is very uncomfortable and RHINO has to tell everybody wh at his four buttons stand for. RHINO:”Please press the red button for tour one, please pres s th yellow button for tour 2, ...” The buttons will be activated afterwards and the visitor is a ble to make her decision. Similarly, in the case of interaction, RH INO has to tell the visitor which button stands for which answer. For example: RHINO:”If you want more information about this exhibit pres s the red button, if not press the blue button.” Clearly, an interaction with spoken language would be much m ore natural and easier to deal with from the user’s point of view. This has motivated us to start work on a speech interface for our mobil e root 1 Department of Computer Science V Aachen University of Techn ology 52056 Aachen, Germany, hdylla,gerhardi@cs.rwth-aachen.de 2 Actually, the speech interface can be used as well for rpl, which is implemented on the same robot. CARL. Ideally, CARL would be able to handle the above situati on like this: CARL:”I am able to tour you thru six different tours. Which on e would you like?” Visitor:”I’d like you to show me tour four.” Then CARL begins tour four. In this paper, we will give an overview of our approach. In ord er to deal with the complexities of both speech recognition and natural language understanding, we employ more or less drastic rest rictions and assumptions where necessary. For example, to reduce noi s we require the user to use a headset. When trying to extract mean ing from a word chain, we heavily rely on the assumption that the a pplication domain is prespecified and fairly small. In particula r, we focus on coffee and mail delivery tasks and assume that the user has some knowledge of the limitations of the system and uses fairly si mple instructions. 3 Nevertheless, our system is devised to be extensible so that, in the future, more complicated dialogues can be handl ed and other applications can be added. The speech control, roughly, consists of two parts, the spee ch recognition software and a semantic parser. The speech reco gnizer extracts a word chain from an analog speech signal. The parse r then builds a semantic representation of the word chain by using a keyword spotting method. In the simplest case, detecting the ke yword ’coffee’ may be all that is needed to determine that the user w ants coffee. The semantic representation is then passed to the GOLOG system, which would then initiate appropriate actions and w hich may include further interaction with the user. A basic structur e of the system is shown in Figure 1. Problems arise if the word chain is fa ulty and strategies have to be devised to prevent wrong robot beha vior due to wrongly detected words. Figure 1. The basic structure of the Speech-Control The rest of the paper is organized as follows. In Section 2 the basics of statistical speech recognition and the speech recog nizers used are described. Section 3 deals with the semantic interpreta tion of a 3 Hence, at this stage we probably cannot handle the above muse u scenario vey well, since visitors cannot be expected to know about the system’s restrictions. word chain and Section 4 briefly introduces the situation cal culus andGOLOG. In Section 5 the mobile robot CARL is presented with its hardand software. We end with a few remarks on the state o f the implementation and some conclusions. 2 Statistical Speech Recognition There are many different methods to build a speech recognize r, for example, using neural networks [2] or distance functions [1 6]. Especially,the statistical interpretation of distance func tions has been found very useful [8, 13], in particular in connection with H idden Markov Models (HMM) [24]. An overview is given in [1, 19, 20, 14]. The basic principles of a statistical speech recognizer are shown in Figure 2. Figure 2. The Basic Structure of a statistical speech recognizer The idea is, roughly, to first find an abstract representation of the acoustic signal and then look for a word chain which best matc hes the representation. The abstract representation is extrac ted by first dividing the signal intoT overlapping frames of 20 to 40 milliseconds duration and then, based on the Fast Fourier Transforma ti n or the discrete cosine transformation [11, 25], assigning eac h framei a so-called feature vector xi. Given a sequence of feature vectors xT1 = x1:::xT , finding the most probable word chain R(xT1 ) can be phrased in terms of conditional probabilities: R(xT1 ) = argmaxw1:::wNP (wN 1 jxT1 ): This should be read as “select that word chain whose probabil ity whenxT1 is given is maximal. The problem, of course, is to find an appropriate probability d stributionP . For that we first apply Bayes’ Rule and obtain 5 R(xT1 ) = argmaxw1:::wNP (xT1 jwN 1 )P (wN 1 ) 4 We remark that there also are non-statistical methods like N earest Neighborhood or Neural Networks [23, 26] 5 As usual, the denominator of Bayes’ Rule is ignored. Hence we need to determine two probability distributions: the conditional probability distributionP (xT1 jwN 1 ); this is called the acoustic model and describes how the words are linked to t he feature vectors; the a priori distribution of the word chain P (wN 1 ); this is the linguistic (or language) model and describes the occurrence pr obability of the particular word series wN 1 ; Both distributions are usually obtained by analyzing large t aining sets involving Baysian Update. We will not go into the detail s here except to mention that Hidden Markov Models are employed for the acoustic model and the linguistic model is often simplified b y using so-called uni-, bior trigram-models are used [21]. For exa mple, the trigram-model is defined as: P (wN 1 ) = N Y n=1P (wnjwn 2; wn 1) Typical problems when dealing with speech recognition are: Velocitity of speech, which is heavily speaker dependant. Large variation in the pronunciations of the same phoneme. the effect of coarticulation, that is, sounds are standing i n context to each other. For example, an ’i’ is pronounced differently in front of a ’t’ than in front an ’s’. Background noise Finally, even after thorough training and testing of a speec h r cognizer, the overall quality of the system can vary significant ly in prac-

[1]  Heinrich Niemann,et al.  Generating word hypotheses in continuous speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Takao Kobayashi,et al.  Speech coding based on adaptive mel-cepstral analysis , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Mark J. F. Gales,et al.  Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Hermann Ney,et al.  Data driven search organization for continuous speech recognition , 1992, IEEE Trans. Signal Process..

[5]  John S. Bridle,et al.  Neural Networks or Hidden Markov Models for Automatic Speech Recognition: Is there a Choice? , 1992 .

[6]  Hector J. Levesque,et al.  GOLOG: A Logic Programming Language for Dynamic Domains , 1997, J. Log. Program..

[7]  Hector J. Levesque,et al.  Foundations for the Situation Calculus , 1998, Electron. Trans. Artif. Intell..

[8]  Wolfram Burgard,et al.  The Interactive Museum Tour-Guide Robot , 1998, AAAI/IAAI.

[9]  Hector J. Levesque,et al.  Reasoning about Concurrent Execution Prioritized Interrupts, and Exogenous Actions in the Situation Calculus , 1997, IJCAI.

[10]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[11]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[12]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[13]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  J. McCarthy Situations, Actions, and Causal Laws , 1963 .

[16]  Drew McDermott,et al.  A reactive plan language , 1991 .

[17]  Wolfram Burgard,et al.  GOLEX - Bridging the Gap between Logic (GOLOG) and a Real Robot , 1998, KI.

[18]  Wolfram Burgard,et al.  The Mobile Robot Rhino , 1995, SNN Symposium on Neural Networks.