Insights into machine lip reading

Computer lip-reading is one of the great signal processing challenges. Not only is the signal noisy, it is variable. However it is almost unknown to compare the performance with human lip-readers. Partly this is because of the paucity of human lip-readers and partly because most automatic systems only handle data that are trivial and therefore not representative of human speech. Here we generate a multiview dataset using connected words that can be analysed by an automatic system, based on linear predictive trackers and active appearance models, and human lip-readers. The automatic system we devise has a viseme accuracy of ≈ 46% which is comparable to poor professional human lip-readers. However, unlike human lip-readers our system is good at guessing its fallibility.

[1]  Barry-John Theobald,et al.  Improving visual features for lip-reading , 2010, AVSP.

[2]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[3]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[4]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[5]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[6]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[8]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[9]  Barry-John Theobald,et al.  Comparing visual features for lipreading , 2009, AVSP.