Regularization improves models of audiovisual integration in speech perception

Tobias Søren Andersen

Last modified: 2013-05-05

Abstract


Visual speech, the speech information mediated by the sight of articulatory mouth movements, can influence the auditory phonetic speech percept. This is demonstrated by the McGurk illusion, where an acoustic speech signal (e.g. /ba/) is perceived differently (as /da/) when perceived audio-visually, dubbed onto the video of an incongruent talking face (articulating /ga/).

A computational account of the integration of information across the senses underlying the McGurk illusion has long been sought. One account, the Fuzzy Logical Model of Perception (FLMP, Massaro, 1998) posits that integration is based on fuzzy truth-values. Here we present an alternative accounts in which integration is based on continuous feature values.

We show that such models can provide a better fit to observed data than the FLMP. In order to take model-flexibility into account, we cross-validate model fits and show that although feature based models have more predictive power than the FLMP, both types of models perform rather poorly. Finally, we show that the predictive power of both types of models improve when the models are regularized by Bayesian priors and that after regularization, feature based models have significantly better predictive power than the FLMP.

Keywords


audiovisual; speech perception; modeling

References


Massaro, D. W. (1998). Perceiving talking faces. Cambridge, Massachusetts: MIT Press.

Conference System by Open Conference Systems & MohSho Interactive