Multimodal language processing for mobile information access

Interfaces for mobile information access need to allow users flexibility in their choice of modes and interaction style in accordance with their preferences, the task at hand, and their physical and social environment. This paper describes the approach to multimodal language processing in MATCH (Multimodal Access To City Help), a mobile multimodal speech-pen interface to restaurant and subway information for New York City. Finite-state methods for multimodal integration and understanding enable users to interact using pen, speech, or dynamic combinations of the two, and a speech-act based multimodal dialogue manager enables mixedinitiative multimodal dialogue. 1. LANGUAGE PROCESSING FOR MOBILE SYSTEMS Mobile information access devices (PDAs, tablet PCs, next generation phones) offer limited screen real estate and no keyboard or mouse, making complex graphical interfaces cumbersome. Multimodal interfaces can address this problem by enabling speech and pen input and output combining synthetic speech and graphics (See [1] for a detailed overview of previous work on multimodal input and output). Furthermore, since mobile devices are used in situations involving different physical and social environments, tasks, and users, they need to allow users to provide input in whichever mode or combination of modes are most appropriate given the situation and the user’s preferences. Our testbed multimodal application MATCH (Multimodal Access To City Help) allows all commands to be expressed either by speech, by pen, or multimodally. This is achieved by capturing the parsing, integration, and understanding of speech and gesture inputs in a single multimodal grammar which is compiled into a multimodal finite-state device. This device is tightly integrated with a speech-act based multimodal dialog manager enabling users to complete commands either in a single turn or over the course of a number of dialogue turns. In Section 2 we describe the MATCH application. In Section 3, we describe the multimodal language processing architecture underlying MATCH. 2. THE MATCH APPLICATION Urban environments present a complex and constantly changing body of information regarding restaurants, cinema and theatre schedules, transportation topology, and timetables. This information is most valuable if it can be delivered effectively while mobile, since users needs change while they are out and the information itself is dynamic (e.g. train times change and shows get cancelled). Thanks to AT&T labs and DARPA ITO (contract No. MDA972-99-30003) for financial support. MATCH is a working city guide and navigation system that enables mobile users to access restaurant and subway information for New York City (NYC). MATCH runs standalone on a Fujitsu pen computer, yet can also run in client-server mode across a wireless network. The user interacts with a graphical interface displaying restaurant listings and a dynamic map showing locations and street information (the Multimodal UI). They are free to give commands or reply to requests using speech, by drawing on the display with a stylus, or using synchronous multimodal combinations of the two modes. For example, they can request to see restaurants using the spoken command show cheap italian restaurants in chelsea. The system will then zoom to the appropriate map location and show the locations of restaurants on the map. Alternatively, they could give the same command multimodally by circling an area on the map and saying show cheap italian restaurants in this neighborhood. If the immediate environment is too noisy or public, the same command can be given completely in pen as in Figure 1, by circling an area and writing cheap and italian. Fig. 1. Unimodal pen command The user can ask for the review, cuisine, phone number, address, or other information for a restaurant or set of restaurants. The system responds with graphical callouts on the display, synchronized with synthetic speech output. For example, if the user says phone numbers for these three restaurants and circles a total of three restaurants as in Figure 2, the system will draw a callout with the restaurant name and number and say, for example Le Zie can be reached at 212-206-8686, for each restaurant in turn (Figure 3). These information seeking commands can also be issued solely with pen. For example, the user could alternatively have circled the restaurants and written phone. The system also provides subway directions. For example, if the user says How do I get to this place? and circles one of the restaurants displayed on the map the system will ask Where do you want to go from? The user can then respond with speech e.g.: 25th Fig. 2. Two area gestures Fig. 3. Phone query callouts Street and 3rd Avenue, with pen by writing e.g. 25th St & 3rd Ave, or multimodally: e.g. from here (with a circle gesture indicating the location). The system then calculates the optimal subway route and dynamically generates a multimodal presentation indicating the series of actions the user needs to take (Figure 4). Fig. 4. Multimodal subway route 3. MULTIMODAL LANGUAGE PROCESSING The multimodal architecture which supports MATCH consists of a series of agents which communicate through a Java-based facilitator MCUBE (Figure 5). We focus in this paper on multimodal input processing: the handling and representation of speech and electronic ink, their integration and interpretation, and the multimodal dialogue manager. [2] presents an experiment on text planning within the MATCH architecture. [3] describes the approach to mobile multimodal logging for MATCH. 3.1. Speech Input Handling In order to provide spoken input, the user must hit a click-to-speak button on the Multimodal UI. We found that in an application such as MATCH which provides extensive unimodal pen-based interaction, it was preferable to use click-to-speak rather than pen-tospeak or open-mike. With pen-to-speak, spurious speech results received in noisy environments can disrupt unimodal pen commands. The click-to-speak button activates a speech manager running on the device which gathers audio and communicates with a recognition server (AT&T’s Watson speech recognition engine). Fig. 5. MATCH Multimodal architecture The output from the recognition server is a lattice of possible word string hypotheses with associated costs. This lattice is passed to the multimodal integrator (MMFST). 3.2. Recognizing and Representing Electronic Ink Just as we determine a lattice of possible word strings for the audio signal in the speech mode, similarly for the gesture mode we need to generate a lattice of possible classifications and interpretations of the electronic ink. A given sequence of ink strokes may contain symbolic gestures such as lines and arrows, handwritten words, and selections of entities on the display. When the user draws on the map, their ink is captured and any objects potentially selected, such as currently displayed restaurants or subway stations, are determined. The electronic ink is broken into a lattice of strokes and passed to the gesture recognition and handwriting recognition components to determine possible classifications of gestures and handwriting in the ink stream. Recognitions are performed both on individual strokes and combinations of strokes in the ink stroke lattice. For MATCH, the handwriting recognizer supports a vocabulary of 285 words, including attributes of restaurants (e.g. ‘chinese’,‘cheap’) and zones and points of interest (e.g. ‘soho’,‘empire’,‘state’,‘building’). The gesture recognizer recognizes a set of 10 basic gestures, including lines, arrows, areas, points, and question marks. It uses a variant of Rubine’s classic template-based gesture recognition algorithm [4] trained on a corpus of sample gestures. In addition to classifying gestures, the gesture recognition component also extracts features such as the base and head of arrows. The gesture and handwriting recognition components enrich the ink stroke lattice with possible classifications of strokes and stroke combinations, and pass this enriched stroke lattice back to the Multimodal UI. The Multimodal UI then takes this classified stroke lattice and the selection information and builds a lattice representation of all the possible interpretations of the user’s ink, which it passes to MMFST. 3.2.1. Representation of Complex Pen-based Input The representation of pen input in MATCH is significantly more involved than in our earlier approach to finite-state multimodal language processing [5, 6], in which the gestures were sequences of simple deictic references to people (Gp) or organizations (Go). The interpretations of electronic ink are encoded as symbol complexes of the following form G FORM MEANING (NUMBER TYPE) SEM. FORM indicates the physical form of the gesture and has values such as area, point, line, arrow. MEANING indicates the meaning of that form; for example an area can be either a loc(ation) or a sel(ection). NUMBER and TYPE indicate the number of entities in a selection (1,2,3, many) and their type (rest(aurant), theatre). SEM is a place holder for the specific content of the gesture, such as the points that make up an area or the identifiers of objects in a selection (e.g. id1, id2). For example, if as in Figure 2, the user makes two area gestures, one around a single restaurant and the other around two restaurants, the resulting gesture lattice will be as in Figure 6. The first gesture (node numbers 0-7) is either a reference to a location (loc.) (0-3,7) or a reference to a restaurant (sel.) (0-2,4-7). The second (nodes 7-13,16) is either a reference to a location (7-10,16) or to a set of two restaurants (7-9,11-13,16). If the user says show chinese restaurants in this neighborhood and this neighborhood, the path containing the two locations (0-3,7-10,16) will be taken when this lattice is combined with speech in MMFST. If the user says tell me about this place and these places, then the path with the