Speech Pattern Discovery using Audio-Visual Fusion and Canonical Correlation Analysis

In this paper, we address the problem of automatic discovery of speech patterns using audio-visual information fusion. Unlike those previous studies based on single audio modality, our work not only uses the acoustic information, but also takes into account the visual features extracted from the mouth region. To improve the effectiveness of the use of multimodal information, several audio-visual fusion strategies, including feature concatenation, similarity weighting and decision fusion, are utilized. Specifically, our decision fusion approach retains the reliable patterns discovered in the audio and visual modalities. Moreover, we use canonical correlation analysis (CCA) to address the issue of temporal asynchrony between audio and visual speech modalities and unbounded dynamic time warping (UDTW) is adopted to search for the speech patterns through audio and visual similarity matrices calculated on the aligned audio and visual sequence. Experiments on an audio-visual corpus show that, for the first time, speech pattern discovery can be improved by the use of visual information. The decision fusion approach shows superior performance compared with standard feature concatenation and similarity weighting. CCAbased audio-visual synchronization plays an important role in the performance improvement.

[1]  Aren Jansen,et al.  NLP on Spoken Documents Without ASR , 2010, EMNLP.

[2]  Lei Xie,et al.  Laplacian Eigenmaps for Automatic Story Segmentation of Broadcast News , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Bin Ma,et al.  Acoustic TextTiling for story segmentation of spoken documents , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Kenneth Ward Church,et al.  Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[5]  James R. Glass,et al.  A Piecewise Aggregate Approximation Lower-Bound Estimate for Posteriorgram-Based Dynamic Time Warping , 2011, INTERSPEECH.

[6]  M. J. Sánchez AudioVisual Speech Recognition Using Motion Based Lipreading , 2004 .

[7]  James R. Glass,et al.  Fast spoken query detection using lower-bound Dynamic Time Warping on Graphical Processing Units , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Nuria Oliver,et al.  Partial sequence matching using an Unbounded Dynamic Time Warping algorithm , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  James R. Glass,et al.  An inner-product lower-bound estimate for dynamic time warping , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  M. E. Sargin,et al.  Audio-Visual Synchronization and Fusion using Canonical Correlation Analysis , 2006 .