Finite-state Methods for Multimodal Parsing and Integration

Finite-state machines have been extensively applied to many aspects of language processing including, speech recognition (Pereira and Riley, 1997; Riccardi et al., 1996), phonology (Kaplan and Kay, 1994; Kartunnen, 1991), morphology (Koskenniemi, 1984), chunking (Abney, 1991; Joshi and Hopely, 1997; Bangalore, 1997), parsing (Roche, 1999), and machine translation (Bangalore and Riccardi, 2000). In Johnston and Bangalore (2000) we showed how finite-state methods can be employed in a new and different task parsing, integration, and understanding of multimodal input. Our approach addresses the particular case of multimodal input to a mobile device where the modes are speech and gestures made on the display with a pen, but has far broader application. The approach uses a multimodal grammar specification which is compiled into a finite-state device running on three tapes. This device takes as input a speech stream and a gesture stream and outputs their combined meaning. The approach overcomes the computational complexity of unificationbased approaches to multimodal processing (Johnston, 1998), enables tighter coupling with speech recognition, and enables straightforward composition with other kinds of language processing such as finite-state translation (Bangalore and Riccardi, 2000). In this paper, we present a revised and updated finite-state model for multimodal language processing which incorporates a number of significant advancements to our approach. We show how gesture symbols can be decomposed into attributes in order to reduce the alphabet of gesture symbols and enable underspecification of required gestures. We present a new mechanism for abstracting over gestural content that cannot be captured in the finite-state machine.1 We address the problems relating to deictic numerals (Johnston, 2000) by introducing a new mechanism for aggregation of adjacent gestures. We also show how spatial parsing of gestural inputs can

[1]  Emmanuel Roche,et al.  Finite state transducers: parsing free and frozen sentences , 1999 .

[2]  Roberto Pieraccini,et al.  Stochastic automata for language modeling , 1996, Comput. Speech Lang..

[3]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[4]  Aravind K. Joshi,et al.  A parser from antiquity , 1996, Nat. Lang. Eng..

[5]  Mehryar Mohri,et al.  A Rational Design for a Weighted Finite-State Transducer Library , 1997, Workshop on Implementing Automata.

[6]  Michael Johnston,et al.  Finite-state Multimodal Parsing and Understanding , 2000, COLING.

[7]  Srinivas Bangalore,et al.  Stochastic Finite-State Models for Spoken Language Machine Translation , 2000, Machine Translation.

[8]  Srinivas Bangalore,et al.  Complexity of lexical descriptions and its relevance to partial parsing , 1997 .

[9]  Mark-Jan Nederhof,et al.  Regular Approximations of CFLs: A Grammatical View , 1997, IWPT.

[10]  Michael Johnston,et al.  Unification-based Multimodal Parsing , 1998, ACL.

[11]  Michael Johnston,et al.  Deixis and Conjunction in Multimodal Systems , 2000, COLING.

[12]  Gertjan van Noord FSA Utilities: A Toolbox to Manipulate Finite-State Automata , 1996, Workshop on Implementing Automata.

[13]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[14]  Arnold L. Rosenberg,et al.  On n-tape finite state acceptors , 1964, SWCT.

[15]  Michael Riley,et al.  Speech Recognition by Composition of Weighted Finite Automata , 1996, ArXiv.

[16]  Yves Schabes,et al.  Finite-State Approximation of Phrase-Structure Grammars , 1997 .

[17]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[18]  L. Karttunen Finite-state Constraints , 1993 .