Lip-reading aids word recognition most in moderate noise: a Bayesian explanation using high-dimensional feature space

Xiang Zhou, Wei Ji Ma, Lars A Ross, John J Foxe, Lucas C Parra
Poster
Time: 2009-07-02  09:00 AM – 10:30 AM
Last modified: 2009-06-04

Abstract


Watching a speaker’s facial movements can dramatically enhance our ability to comprehend words, especially in noisy
environments. From a general doctrine of combining information from different sensory modalities (the principle of inverse
effectiveness), one would expect that the visual signals would be most effective at the highest levels of auditory noise. In
contrast, we find, in accord with a recent paper, that visual information improves performance more at intermediate levels
of auditory noise than at the highest levels, and we show that a novel visual stimulus containing only temporal information
does the same. We present a Bayesian model of optimal cue integration that can explain these conflicts. In this model,
words are regarded as points in a multidimensional space and word recognition is a probabilistic inference process. When
the dimensionality of the feature space is low, the Bayesian model predicts inverse effectiveness; when the dimensionality is
high, the enhancement is maximal at intermediate auditory noise levels. When the auditory and visual stimuli differ slightly
in high noise, the model makes a counterintuitive prediction: as sound quality increases, the proportion of reported words
corresponding to the visual stimulus should first increase and then decrease. We confirm this prediction in a behavioral
experiment. We conclude that auditory-visual speech perception obeys the same notion of optimality previously observed
only for simple multisensory stimuli.

Conference System by Open Conference Systems & MohSho Interactive Multimedia