Vectorized finite state automata
暂无分享,去创建一个
We present a technique of nite state parsing based on vectorization and describe the application of this technique to a well-known problem of natural language processing , that of extracting relational information from English text. We deene Vectorized Finite State Automata, the theoretical model behind the applied system, and discuss their signiicance. 0 Introduction One of the persistent problems in building nite automata on the large scale required by actual applications is that the product and powerset constructions routinely used to implement intersection and nondeterminism can, in a few steps, increase the size of the state space beyond reasonable bounds. This paper will describe how t o a void this problem by structuring the state space as a generalized nite vector space. Section 1 of the paper introduces the problem by means of a highly artiicial but simple example, informally presents the basic idea of Vectorized Finite State Automata VFSA, and outlines the VFSA solution for this particular problem. Section 2 presents an overview of NewsMonitor, a system extracting relational information from English text, with particular emphasis on the VFSA pattern matching engine around which NewsMonitor is built. Section 3 provides the formal deeni-tion of VFSA, discusses their properties, and compares them to Register Vector Grammars RVGs 33. The theoretical implications of the work are discussed in Section 4. Broadly speaking, there are three ways vectorization can enter the standard setup for nite state language modeling. First, the alphabet itself can be composed of n-tuples, a con-ceptualization particularly useful for n-ary regular relations and n-way nite automata 99. Second, the alphabet symbols in a single dimension can be thought o f a s v ectors composed of binary features, as is commonly done in Prague-style and in generative phonology 44,,5. Third, the state space itself can be conceptualized as a vector space, as in RVGs. In this paper we explore this third possibility in the VFSA framework that also encompasses what we take to be the crucial aspects of the rst and the second kinds of vectorization. One way to specify an n-ary regular relation is by a n n-way nite state transducer FST deened by a nite set S of states, some designated as initiallnal, and a nite list L of arcs, where each arc carries an n-tuple of symbols. Other than providing convenient labels for beginning and endpoints of arcs, in practice states play so little role that …
[1] Martin Kay,et al. Regular Models of Phonological Rule Systems , 1994, CL.
[2] Lisa F. Rau,et al. SCISOR: extracting information from on-line news , 1990, CACM.
[3] András Kornel,et al. Natural Languages And The Chomsky Hierarchy , 1985, EACL.
[4] Steven Bird,et al. Computational Phonology , 2002, Speech Processing.
[5] Noam Chomsky,et al. The Sound Pattern of English , 1968 .