Confusion modelling for automated lip-reading usingweighted finite-state transducers

Automated lip-reading involves recognising speech from only the visual signal. The accuracy of current state-ofthe-art lip-reading systems is significantly lower than that obtained by acoustic speech recognisers. These poor results are most likely due to the lack of information about speech production that is available in the visual signal: for example, it is impossible to discriminate voiced and unvoiced sounds, or many places of articulation, from visual signals. Our approach to this problem is to regard the visual speech signal as having been produced by a speaker who has a reduced phonemic repertoire and to attempt to compensate for this. In this respect, visual speech is similar to dysarthric speech, which is produced by a speaker who has poor control over their articulators, leading them to speak with a reduced and distorted set of phonemes. In previous work, we found that the use of weighted finite-state transducers improved recognition performance on dysarthric speech considerably. In this paper, we report the results of applying this technique to lip-reading. The technique works, but our initial results are not as good as those obtained by using a conventional approach, and we discuss why this might be so and what the prospects for future investigation are.