The EUTRANS project, aims at developing Machine Translation systems for limited domain applications. These systems accept speech and text input, and are trained using an example based approach. The translation model used in this project is the Subsequential Transducer, which is easily integrable in conventional speech recognition systems. In addition, Subsequential Transducers can be automatically learned from corpora. This paper describes the use of categories for improving the EUTRANS translation systems. Experimental results with the task defined in the project show that this approach reduces the number of examples required for achieving good models. 1 I n t r o d u c t i o n The EUTRANS project 1 (Amengual et al., 1996a), funded by the European Union, aims at developing Machine Translation systems for limited domain applications. These systems accept speech and text input, and are trained using an example based approach. The translation model used in this project is the Subsequential Transducer (SST), which is easily integrable in conventional speech recognition systems by using it both as language and translation model (Jimdnez et al., 1995). In addition, SSTs can be automatically learned from sentence aligned bilingual corpora (Oncina et ai., 1993). This paper describes the use of categories both in the training and translation processes for improving the EUTRANS translation systems. The 1Example-Based Understanding and Translation Systems (EuTRANS). Information Technology, Long Term Research Domain, Open Scheme, Project Number 20268. F. C a s a c u b e r t a 2 A. Cas ta f io 1 A. Marzal 1 F. P r a t 1 J . M. Vi l a r 1 (2) Depto. de Sistemas Inform~ticos y Computacidn Universidad Politdcnica de Valencia 46071 Valencia (Spain) approach presented here improves that in (Vilar et al., 1995), the integration of categories within the systems is simpler, and it allows for categories grouping units larger than a word. Experimental results with the Traveler Task, defined in (Amengual et al., 1996b), show that this method reduces the number of examples required for achieving good models. The rest of the paper is structured as follows. In section 2 some basic concepts and the notation are introduced. The technique used for integrating categories in the system is detailed in section 3. Section 4 presents the speech translation system. Both speech and text input experiments are described in section 5. Finally, section 6 presents some conclusions and new directions. 2 Bas ic C o n c e p t s r id N o t a t i o n Given an alphabet X, X* is the free monoid of strings over X. The symbol A represents the empty string, first letters (a, b, c, . . . ) represent individual symbols of the alphabets and last letters (z, y, x, . . . ) represent strings of the free monoids. We refer to the individual elements of the strings by means of subindices, as in x = a l . . . a n . Given two strings x , y E X ' , xy denotes the concatenation of x and y. 2.1 Subsequent ia l Transducers A Subsequential Transducer (Berstel, 1979) is a deterministic finite state network that accepts sentences from a given input language and produces associated sentences of an output language. A SST is composed of states and arcs. Each arc connects two states and it is associated to an input symbol and an output substring (that may be empty). Translation of an input sentence is obtained starting from the initial state, following the path corresponding to its symbols through the network, and concatenating the corresponding output substrings.
[1]
Jeffrey D. Ullman,et al.
Introduction to Automata Theory, Languages and Computation
,
1979
.
[2]
Michael G. Thomason,et al.
Syntactic Pattern Recognition, An Introduction
,
1978,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[3]
John Cocke,et al.
A Statistical Approach to Machine Translation
,
1990,
CL.
[4]
Enrique Vidal,et al.
Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks
,
1993,
IEEE Trans. Pattern Anal. Mach. Intell..
[5]
Enrique Vidal,et al.
Application of OSTIA to Machine Translation Tasks
,
1994,
ICGI.
[6]
Enrique Vidal,et al.
Some results with a trainable speech translation and understanding system
,
1995,
1995 International Conference on Acoustics, Speech, and Signal Processing.
[7]
Enrique Vidal,et al.
Learning language translation in limited domains using finite-state models: some extensions and improvements
,
1995,
EUROSPEECH.
[8]
Francisco Casacuberta,et al.
Grammatical Inference and Automatic Speech Recognition
,
1995
.
[9]
José Oncina,et al.
Using domain information during the learning of a subsequential transducer
,
1996,
ICGI.
[10]
Hermann Ney,et al.
Speech translation based on automatically trainable finite-state models
,
1997,
EUROSPEECH.
[11]
Francisco Casacuberta,et al.
Error correcting parsing for text-to-text machine translation using finite state models
,
1997,
TMI.