Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection

BackgroundSpeaker detection is an important component of many human-computer interaction applications, like for example, multimedia indexing, or ambient intelligent systems. This work addresses the problem of detecting the current speaker in audio-visual sequences. The detector performs with few and simple material since a single camera and microphone meets the needs.MethodA multimodal pattern recognition framework is proposed, with solutions provided for each step of the process, namely, the feature generation and extraction steps, the classification, and the evaluation of the system performance. The decision is based on the estimation of the synchrony between the audio and the video signals. Prior to the classification, an information theoretic framework is applied to extract optimized audio features using video information. The classification step is then defined through a hypothesis testing framework in order to get confidence levels associated to the classifier outputs, allowing thereby an evaluation of the performance of the whole multimodal pattern recognition system.ResultsThrough the hypothesis testing approach, the classifier performance can be given as a ratio of detection to false-alarm probabilities. Above all, the hypothesis tests give means for measuring the whole pattern recognition process effciency. In particular, the gain offered by the proposed feature extraction step can be evaluated. As a result, it is shown that introducing such a feature extraction step increases the ability of the classifier to produce good relative instance scores, and therefore, the performance of the pattern recognition process.ConclusionThe powerful capacities of hypothesis tests as an evaluation tool are exploited to assess the performance of a multimodal pattern recognition process. In particular, the advantage of performing or not a feature extraction step prior to the classification is evaluated. Although the proposed framework is used here for detecting the speaker in audiovisual sequences, it could be applied to any other classification task involving two spatio-temporal co-occurring signals.

[1]  Jean-Philippe Thiran,et al.  Face detection with boosted Gaussian features , 2007, Pattern Recognit..

[2]  Jean-Philippe Thiran,et al.  From error probability to information theoretic (multi-modal) signal processing , 2005, Signal Process..

[3]  John W. Fisher,et al.  Nonparametric hypothesis tests for statistical dependency , 2004, IEEE Transactions on Signal Processing.

[4]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[5]  Jean-Philippe Thiran,et al.  Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection , 2008, IEEE Transactions on Multimedia.

[6]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[7]  Jan Koch,et al.  Engineering Tele-Health Solutions in the Ambient Assisted Living Lab , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[8]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[9]  Jean-Philippe Thiran,et al.  Extraction of Audio Features Specific to Speech using Information Theory and Differential Evolution , 2005 .

[10]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[12]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[13]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[14]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[15]  Pierre Vandergheynst,et al.  Experimental evaluation framework for speaker detection on the CUAVE database , 2006 .