Statistical analysis of the relationship between audio and video speech parameters for Australian English

After decades of research, automatic speech processing has become more and more viable in recent years. Audio-video speech recognition has been shown to improve the recognition rate in noise-degraded environments. However, which audio and video speech parameters to choose for an optimal system and how they are related is still an open research issue. Here we present a number of statistical analyses that aim at increasing our understanding of such audio-video relationships. In particular, we look at the canonical correlation analysis and the coinertia analysis which investigate the relationship of linear combinations of parameters. The analyses are performed on Australian English as an example.

[1]  Jean Thioulouse,et al.  ADE-4: a multivariate analysis and graphical display software , 1997, Stat. Comput..

[2]  Jean Thioulouse,et al.  CO‐INERTIA ANALYSIS AND THE LINKING OF ECOLOGICAL DATA TABLES , 2003 .

[3]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[4]  Timothy F. Cootes,et al.  Lipreading Using Shape, Shading and Scale , 1998, AVSP.

[5]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[6]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[8]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[9]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[10]  K. Ruben Gabriel,et al.  A permutation test of association between configurations by means of the rv coefficient , 1998 .

[11]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[12]  Alexander Zelinsky,et al.  Validation of an automatic lip-tracking algorithm and design of a database for audio-video speech processing , 2000 .

[13]  Alexander Zelinsky,et al.  Automatic Extraction of Lip Feature Points , 2000 .

[14]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  L. Tucker An inter-battery method of factor analysis , 1958 .