Learning audio-visual associations using mutual information

This paper addresses the problem of finding useful associations between audio and visual input signals. The proposed approach is based on the maximization of mutual information of audio-visual clusters. This approach results in segmentation of continuous speech signals, and finds visual categories which correspond to segmented spoken words. Such audio-visual associations may be used for modeling infant language acquisition and to dynamically personalize speech-based human-computer interfaces for various applications including catalog browsing and wearable computing. This paper describes an implemented system for learning shape names from camera and microphone input. We present results in an evaluation of the system for the domain of modeling language learning.

[1]  Rajesh P. N. Rao,et al.  Object indexing using an iconic sparse distributed memory , 1995, Proceedings of IEEE International Conference on Computer Vision.

[2]  I.J. Cox,et al.  Probabilistic data association for dynamic world modeling: a multiple hypothesis approach , 1991, Fifth International Conference on Advanced Robotics 'Robots in Unstructured Environments.

[3]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[4]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Sven Wachsmuth,et al.  Multilevel Integration of Vision and Speech Understanding Using Bayesian Networks , 1999, ICVS.

[6]  Hugh F. Durrant-Whyte,et al.  Sensor Models and Multisensor Integration , 1988, Int. J. Robotics Res..

[7]  Bernt Schiele,et al.  Probabilistic object recognition using multidimensional receptive field histograms , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[8]  J. Huttenlocher,et al.  Early word meanings: The case of object names , 1987, Cognitive Psychology.

[9]  Alex Pentland,et al.  Attentional Objects for Visual Context Understanding , 1999 .

[10]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[11]  Michael J. Carey,et al.  Statistical models for topic identification using phoneme substrings , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Stephen E. Levinson,et al.  Adaptive acquisition of language , 1991 .

[14]  Richard Rose,et al.  Word Spotting from Continuous Speech Utterances , 1996 .

[15]  Alex Pentland,et al.  An Interactive Computer Vision System DyPERS: Dynamic Personal Enhanced Reality System , 1999, ICVS.

[16]  Stephanie Seneff,et al.  Transcription and Alignment of the TIMIT Database , 1996 .

[17]  M. Basseville Information : entropies, divergences et moyennes , 1996 .

[18]  A. Gorin On automated language acquisition , 1989 .

[19]  Bernt Schiele,et al.  Object Recognition Using Multidimensional Receptive Field Histograms , 1996, ECCV.

[20]  Hiroshi Murase,et al.  Learning and recognition of 3D objects from appearance , 1993, [1993] Proceedings IEEE Workshop on Qualitative Vision.

[21]  Carl de Marcken,et al.  Unsupervised language acquisition , 1996, ArXiv.

[22]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[23]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[24]  Hugh Durrant-Whyte,et al.  Data Fusion and Sensor Management: A Decentralized Information-Theoretic Approach , 1995 .

[25]  C. Snow Mothers' Speech to Children Learning Language. , 1972 .

[26]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[27]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[28]  Hans P. Moravec Sensor Fusion in Certainty Grids for Mobile Robots , 1988, AI Mag..

[29]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.