Introducing statistical dependencies and structural constraints in variable-length sequence models

In the field of natural language processing, as in many other domains, the efficiency of pattern recognition algorithms is highly conditioned to a proper description of the underlying structure of the data. However, this hidden structure is usually not known, and it has to be learned from examples. The multigram model [1, 2] was originally designed to extract variable-length regularities within streams of symbols, by describing the data as the concatenation of statistically independent sequences. Such a description seems especially appealing in the case of natural language corpora, since natural language syntactic regularities are clearly of variable length: sentences are composed of a variable number of syntagms, which in turn are made of a variable number of words, which contain a variable number of morphemes, etc... However, some previous experiments with this model [3] revealed the inadequacy of the independence assumption in the particular context of a graphemeto-phoneme transcription task. In this paper, our goal is therefore twofold: