Audiovisual synchrony detection for speech and music signals
Hwee-Ling Lee, Uta Noppeney
Poster
Time: 2009-06-30 09:00 AM – 10:30 AM
Last modified: 2009-06-04
Abstract
Introduction: Audiovisual integration crucially depends on the relative timing of the auditory and visual signals. Although multisensory signals do not have to be precisely physically synchronous in order to be perceived as single temporal events, they have to co-occur within a certain temporal window of integration. To investigate how the human brain is fine tuned to the natural temporal statistics of audiovisual signals, we characterized the temporal integration window for natural speech, sinewave replicas of natural speech (SWS) and music in a simultaneity judgment task.
Methods: The experimental paradigm manipulated: 1) stimulus class: speech vs. SWS vs. music, and 2) stimulus length: short (i.e. natural syllables, SWS syllables and tones) vs. long (i.e. natural sentences, SWS sentences and melodies). Audiovisual asynchronies ranged from -360ms (auditory leading) to 360 ms (visual leading) in 60ms increments. Eight participants performed the experiment on 2 separate days. The order of conditions was counterbalanced within and between subjects. The proportion of synchronous responses was computed for each participant. To refrain from making any distributional assumptions, the psychometric curves of each participant were characterized by four indices: (i) peak performance, (ii) peak location, (iii) width and (iv) asymmetry [1]. The four indices were analyzed using repeated measures of ANOVAs with stimulus class and stimulus length as within-subjects factors.
Results: The ANOVA for peak performance did not show any significant main effects of stimulus class and length [F(2,14)<1, n.s.; F(1,7)=1.6, p=.24]. The ANOVA for peak location revealed a significant interaction between stimulus class and length [F(2,14)=3.8, p<.05]. Post-hoc paired t-tests revealed that the peak locations were significantly shifted towards auditory leading for melodies compared to tones [t(7)=2.4, p<.05], and for melodies compared to SWS sentences [t(7)=-2.3, p=.053]. The ANOVA for width revealed significant main effects of stimulus class and length [F(2,14)=9.3, p<.005; F(1,7)=11.0, p<.05] in the absence of an interaction [F(2,14)<1, n.s.]. Post-hoc paired t-tests revealed that the widths were wider for SWS speech than natural speech [t(7)=7.0, p<.005] and music [t(7)=2.4, p=.05]. Furthermore, the widths were narrower for long stimuli (i.e. sentences and melodies) than short stimuli (i.e. syllables and tones) [t(7)=-3.3, p<.05]. With respect to the asymmetry, there was a significant main effect of stimulus length [F(1,7)=7.1, p<.05] but not stimulus class [F(2,14)=1.1, p=.35], thus indicating that the psychometric curves were more asymmetric for long stimuli (i.e. sentences and melodies) than short stimuli (i.e. syllables and tones).
Conclusion: Our results demonstrated that the psychometric curves were narrower and more asymmetric for long stimuli (i.e. sentences and melodies) than short stimuli (i.e. syllables and tones). Thus, participants may rely on information during the entire sentence for synchrony judgments. In addition, our results demonstrated that the psychometric curves were wider but less asymmetric for SWS speech relative to natural speech and music. Collectively, our results support the hypothesis that audiovisual speech perception is fine-tuned to the natural mapping between facial movement and spectrotemporal structure of natural speech.
Reference:
1. Maier J., Luca M.D. and Noppeney U. (2009). Audiovisual asynchrony detection in human speech (submitted).
Methods: The experimental paradigm manipulated: 1) stimulus class: speech vs. SWS vs. music, and 2) stimulus length: short (i.e. natural syllables, SWS syllables and tones) vs. long (i.e. natural sentences, SWS sentences and melodies). Audiovisual asynchronies ranged from -360ms (auditory leading) to 360 ms (visual leading) in 60ms increments. Eight participants performed the experiment on 2 separate days. The order of conditions was counterbalanced within and between subjects. The proportion of synchronous responses was computed for each participant. To refrain from making any distributional assumptions, the psychometric curves of each participant were characterized by four indices: (i) peak performance, (ii) peak location, (iii) width and (iv) asymmetry [1]. The four indices were analyzed using repeated measures of ANOVAs with stimulus class and stimulus length as within-subjects factors.
Results: The ANOVA for peak performance did not show any significant main effects of stimulus class and length [F(2,14)<1, n.s.; F(1,7)=1.6, p=.24]. The ANOVA for peak location revealed a significant interaction between stimulus class and length [F(2,14)=3.8, p<.05]. Post-hoc paired t-tests revealed that the peak locations were significantly shifted towards auditory leading for melodies compared to tones [t(7)=2.4, p<.05], and for melodies compared to SWS sentences [t(7)=-2.3, p=.053]. The ANOVA for width revealed significant main effects of stimulus class and length [F(2,14)=9.3, p<.005; F(1,7)=11.0, p<.05] in the absence of an interaction [F(2,14)<1, n.s.]. Post-hoc paired t-tests revealed that the widths were wider for SWS speech than natural speech [t(7)=7.0, p<.005] and music [t(7)=2.4, p=.05]. Furthermore, the widths were narrower for long stimuli (i.e. sentences and melodies) than short stimuli (i.e. syllables and tones) [t(7)=-3.3, p<.05]. With respect to the asymmetry, there was a significant main effect of stimulus length [F(1,7)=7.1, p<.05] but not stimulus class [F(2,14)=1.1, p=.35], thus indicating that the psychometric curves were more asymmetric for long stimuli (i.e. sentences and melodies) than short stimuli (i.e. syllables and tones).
Conclusion: Our results demonstrated that the psychometric curves were narrower and more asymmetric for long stimuli (i.e. sentences and melodies) than short stimuli (i.e. syllables and tones). Thus, participants may rely on information during the entire sentence for synchrony judgments. In addition, our results demonstrated that the psychometric curves were wider but less asymmetric for SWS speech relative to natural speech and music. Collectively, our results support the hypothesis that audiovisual speech perception is fine-tuned to the natural mapping between facial movement and spectrotemporal structure of natural speech.
Reference:
1. Maier J., Luca M.D. and Noppeney U. (2009). Audiovisual asynchrony detection in human speech (submitted).