The DIRAC AWEAR audio-visual platform for detection of unexpected and incongruent events

It is of prime importance in everyday human life to cope with and respond appropriately to events that are not foreseen by prior experience. Machines to a large extent lack the ability to respond appropriately to such inputs. An important class of unexpected events is defined by incongruent combinations of inputs from different modalities and therefore multimodal information provides a crucial cue for the identification of such events, e.g., the sound of a voice is being heard while the person in the field-of-view does not move her lips. In the project DIRAC ("Detection and Identification of Rare Audio-visual Cues") we have been developing algorithmic approaches to the detection of such events, as well as an experimental hardware platform to test it. An audio-visual platform ("AWEAR" - audio-visual wearable device) has been constructed with the goal to help users with disabilities or a high cognitive load to deal with unexpected events. Key hardware components include stereo panoramic vision sensors and 6-channel worn-behind-the-ear (hearing aid) microphone arrays. Data have been recorded to study audio-visual tracking, a/v scene/object classification and a/v detection of incongruencies.

[1]  Zuzana Kukelova,et al.  A minimal solution to the autocalibration of radial distortion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Volker Hohmann,et al.  Objective perceptual quality assessment for self-steering binaural hearing aid microphone arrays , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Michal Havlena,et al.  Measuring camera translation by the dominant apical angle , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Tomás Pajdla,et al.  Structure from motion with wide circular field of view cameras , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  James J. Clark,et al.  Data Fusion for Sensory Information Processing Systems , 1990 .

[6]  Bernt Schiele,et al.  Robust Object Detection with Interleaved Categorization and Segmentation , 2008, International Journal of Computer Vision.

[7]  D. Burr,et al.  Combining visual and auditory information. , 2006, Progress in brain research.

[8]  Jörn Anemüller,et al.  Detection of speech embedded in real acoustic background based on amplitude modulation spectrogram features , 2008, INTERSPEECH.

[9]  Hynek Hermansky,et al.  Combination of strongly and weakly constrained recognizers for reliable detection of OOVS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Luc Van Gool,et al.  Dynamic 3D Scene Analysis from a Moving Vehicle , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Daphna Weinshall,et al.  Biologically Motivated Audio-Visual Cue Integration for Object , 2008 .

[12]  H. Bülthoff,et al.  Merging the senses into a robust percept , 2004, Trends in Cognitive Sciences.

[13]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[14]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.