Regularized models of audiovisual integration of speech with predictive power for sparse behavioral data

Abstract Audiovisual integration can facilitate speech comprehension by integrating information from lip-reading with auditory speech perception. When incongruent acoustic speech is dubbed onto a video of a talking face, this integration can lead to the McGurk illusion of hearing a different phoneme than that spoken by the voice. Several computational models of the information integration process underlying these phenomena exist. All are based on the assumption that the integration process is, in some sense, optimal. They differ, however, in assuming that it is based on either continuous or categorical internal representations. Here we develop models of audiovisual integration of the phonetic information represented on an internal representation that is continuous and cyclical. We compare these models to the Fuzzy Logical Model of Perception (FLMP), which is based on a categorical internal representation. Using cross-validation, we show that model evaluation criteria based on the goodness-of-fit are poor measures of the models’ generalization error even if they take the number of free parameters into account. We also show that the predictive power of all the models benefit from regularization that limits the precision of the internal representation. Finally, we show that, unlike the FLMP, models based on a continuous internal representation have good predictive power when properly regularized.

[1]  H. Bülthoff,et al.  Merging the senses into a robust percept , 2004, Trends in Cognitive Sciences.

[2]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  John F. Magnotti,et al.  Variability and stability in the McGurk effect: contributions of participants, stimuli, time, and response type , 2015, Psychonomic bulletin & review.

[5]  D. Massaro,et al.  Bayes factor of model selection validates FLMP , 2001, Psychonomic bulletin & review.

[6]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[7]  H. Akaike Likelihood of a model and information criteria , 1981 .

[8]  K. Tiippana What is the McGurk effect? , 2014, Front. Psychol..

[9]  Anne-Lise Giraud,et al.  Prediction across sensory modalities: A neurocomputational model of the McGurk effect , 2015, Cortex.

[10]  Ari Rosenberg,et al.  Optimized but Not Maximized Cue Integration for 3D Visual Perception , 2019, eNeuro.

[11]  I. J. Myung,et al.  Applying Occam’s razor in modeling cognition: A Bayesian approach , 1997 .

[12]  Neil A. Macmillan,et al.  Detection Theory: A User's Guide , 1991 .

[13]  Mikko Sams,et al.  Using the Fuzzy Logical Model of Perception in measuring integration of audiovisual speech in humans , 2002 .

[14]  Mark A. Pitt,et al.  Model Evaluation, Testing and Selection , 2005 .

[15]  Michael S. Beauchamp,et al.  A Causal Inference Model Explains Perception of the McGurk Effect and Other Incongruent Audiovisual Speech , 2017, PLoS Comput. Biol..

[16]  A A Montgomery,et al.  Auditory and visual contributions to the perception of consonants. , 1974, Journal of speech and hearing research.

[17]  Mikko Sams,et al.  Maximum Likelihood Integration of rapid flashes and beeps , 2005, Neuroscience Letters.

[18]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[19]  Geoffrey W. Hill,et al.  Algorithm 518: Incomplete Bessel Function I0. The Von Mises Distribution [S14] , 1977, TOMS.

[20]  Richard F Murray,et al.  Cue combination on the circle and the sphere. , 2010, Journal of vision.

[21]  Tobias S. Andersen,et al.  Visual attention modulates audiovisual speech perception , 2004 .

[22]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[23]  Konrad Paul Kording,et al.  Causal Inference in Multisensory Perception , 2007, PloS one.

[24]  Q Summerfield,et al.  Use of Visual Information for Phonetic Perception , 1979, Phonetica.

[25]  Tobias S Andersen,et al.  The early maximum likelihood estimation model of audiovisual integration in speech perception. , 2015, The Journal of the Acoustical Society of America.

[26]  Jean-Luc Schwartz,et al.  The 0/0 problem in the fuzzy-logical model of perception. , 2006, The Journal of the Acoustical Society of America.

[27]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.