Audio-visual speech perception in infancy - A cross-linguistic comparison of multisensory perceptual narrowing and face-scanning behavior in languages belonging to the same rhythm class (German and Swedish)





Professorship/Faculty: Fakultät Humanwissenschaften: Abschlussarbeiten ; University of Bamberg  
Author(s): Dorn, Katharina
Corporate Body: Otto-Friedrich Universität Bamberg - Lehrstuhl für Entwicklungspsychologie
Publisher Information: Bamberg : Otto-Friedrich-Universität
Year of publication: 2021
Pages / Size: X, 236 Seiten : Illustrationen, Diagramme
Supervisor(s): Weinert, Sabine  ; Carbon, Claus-Christian  
Year of first publication: 2020
Language(s): English
Remark: 
Kumulative Dissertation, Otto-Friedrich-Universität Bamberg, 2020
DOI: 10.20378/irb-49306
Licence: Creative Commons - CC BY-NC - Attribution - NonCommercial 4.0 International 
URN: urn:nbn:de:bvb:473-irb-493065
Abstract: 
The importance of considering speech perception and language acquisition as a multimodal phenomenon, that is to say an audio-visual phenomenon, can hardly be ignored in light of recent evidence. Research from this perspective has demonstrated that young infants are sensitive to audio-visual match in auditory (i.e. syllables, vowels and utterances) and visual (i.e. mouth movements) native and non-native speech, even when presented sequentially. Over time, as they gain more experience, infants’ perception and processing of native language attributes increases, while this sensitivity seems to decline for non-native attributes (perceptual narrowing). Empirical findings in the field of perceptual narrowing are ambiguous with regard to the beginning and the extent of this tuning phenomenon, but there is evidence that factors such as the richness and presentation of the stimuli play a crucial role. Recently, there has been renewed interest in the topic of face-scanning behavior, mainly because eye-tracking devices have made more objective and precise analyses of infants’ gaze patterns possible. Face-scanning behavior is directly associated with audio-visual speech processing, and both have an impact on infants’ future expressive language development. However, no previous study has ever examined the distance between the native and non-native language in the context of audio-visual speech processing. This is illustrated by the fact that previously studies have exclusively considered more distant languages belonging to different rhythm classes, not closer languages belonging to the same rhythm class. Languages that largely do not differ in global rhythmic-prosodic cues but for instance in more specific phonological and phonetic attributes might impact audio-visual matching and face-scanning behavior in early infancy. This influence might provide insights into how fine-grained these perception and processing mechanisms are marked during infancy, when they narrow in the direction of the infant’s native language, and which facial areas infants draw on at different time points during infancy to obtain enough (redundant) cues to acquire their native language(s). Furthermore, no previous studies have combined a longitudinal perspective on infants with a cross-linguistic view of languages in order to reduce inter-individual differences across age groups and generalize the emergence of perceptual narrowing as a cross-linguistic phenomenon. Hence, the present synopsis comprises three studies that address these perspectives on early audio-visual speech perception of languages belonging to the same rhythm class among infants by investigating early audio-visual matching sensitivities (Study 1), the occurrence of perceptual narrowing (Study 2), and face-scanning behavior during the first year of life and its impact on the infants’ future expressive vocabulary (Study 3). It summarizes the current state of the (empirical) literature in subjects such as speech perception, language discrimination and face-scanning behavior before identifying important research gaps, pointing out relevant research questions, presenting the design(s) and the main results of the three empirical studies, and finally discussing the findings and the consequential possible implications for future research and practice. The studies are based on self-collected data from the Bamberg Baby Institute at the University of Bamberg (Germany) and the Uppsala Child and Baby Lab at Uppsala University (Sweden). Whereas the first and second study were based on a cross-linguistic dataset of German and Swedish infants, the third study’s dataset consisted only of German infants who were further followed longitudinally.
Study 1 addressed the research gap of whether infants not only make use of global rhythmic-prosodic cues (suprasegmental attributes) but also of more subtle language properties e.g. phonological, phonetic (segmental attributes) and additional slightly distinctive rhythmic-prosodic cues, in languages belonging to the same rhythm class to be sensitive to discriminate between and audio-visually match languages. The study demonstrated for the first time that infants as young as 4.5 months of age are sensitive to extract subtle language properties from two languages belonging to the same rhythm class (German and Swedish) and sequentially match fluent speech they have heard and seen even in the absence of temporal synchrony, idiosyncratic aspects and global rhythmic-prosodic cues (suprasegmental attributes). Even despite sparse linguistic knowledge on the infants’ part, this empirical finding confirms the remarkably early emergence of infants’ sensitivity to extract relevant audio-visual speech information and subsequently retain this information in short-term memory, thus going beyond purely perceptual, here-and-now processing.
Study 2 built upon this first study by addressing the research question of whether the same infants exhibit responses indicative of perceptual narrowing towards their native language at around 6 months of age, even if presented with two languages belonging to the same rhythm class. The study provided evidence that in the context of sequentially presented rich audio-visual speech utterances, the same infants’ perception now tested at 6 months of age narrowed in the direction of their native language (either German or Swedish). These changes in sensitivity became manifest in significantly different gaze durations for their native language after listening to the same. The German infants exhibited the expected familiarity effect – looking significantly longer to their native language after listening to the same - while the Swedish infants exhibited an unexpected novelty effect – looking significantly shorter to their native language after listening to the same. This discrepancy might result from the Swedish 6-month-old infants’ greater attentional focus on the German visual speech even during baseline, i.e. specific acoustic characteristics that particularly attracted the Swedish 6-month-old infants’ attention, or the different linguistic backgrounds of the two infant samples (infants growing up in Sweden often hear more than just one language even if their parents are native Swedish). Nevertheless, any divergence from random looking behavior is indicative of the infants’ sensitivity to discriminate between the presented stimuli. Thus, these two studies indicate the necessity of taking language distances into account in future studies.
Study 3 added more detailed analyses of the infants’ gaze patterns in the context of face-scanning behavior by addressing the research question of how infants scan facial regions (i.e. eyes or mouth) of an articulating face during the first year of life in the context of rhythmically similar languages and how their face-scanning behavior is associated with expressive language outcomes in the second year of life. This study demonstrated that even when presenting languages belonging to the same rhythm class, the first attentional shift towards the mouth occurred at 8 months of age, independent on the presented language. The presented language seemed to have an influence beginning at 12 months of age: only after listening to their native language the infants begin to turn back their looking behavior to the eyes (second attentional shift), whereas after listening to a non-native language, their looking behavior remained at chance level. This last aspect differed in previous studies using languages belonging to different rhythm classes, with infants preferring the mouth after listening to a more distant non-native language. Furthermore and considered with caution, only gaze behavior at 12 months of age exhibited a slightly marginal association with the infants’ expressive vocabulary at 18 months of age – the more 12-month-old infants looked at the mouth, the more words they were able to express at 18 months of age. Taken together, the three studies making up the present synopsis provide additional empirical evidence in the complex research area of audio-visual speech perception. The appearance of similar results to previous findings, except that languages belonging to the same rhythm class were used in these studies, reflects that the infants’ sensitivity to audio-visual match and scan certain facial regions with benefits is not only attributable to suprasegmental cues but also attributed to segmental cues. In other words, infants are more sensitive to identify more fine-grained speech attributes (e.g. phonetic, phonological and slightly distinctive rhythmic-prosodic cues) in languages belonging to the same rhythm class than has ever been shown before. For this reason, it is of great importance for future studies to consider language distance as a supplementary variable when analyzing infants’ speech processing. The finding that infants at 4.5 months of age were sensitive to audio-visual match their native and a non-native language, but beyond 6 months of age became more sophisticated in processing their native attributes (perceptual narrowing), stresses the importance of early interventions in deaf and hearing-impaired infants (e.g. implanting cochlear implants at an early age within this apparently sensitive developmental period).
SWD Keywords: Kleinkind ; Spracherwerb ; Sprachverstehen ; Sprachwahrnehmung
Keywords: speech perception, audio-visual matching, (multisensory) perceptual narrowing, phonological/phonetic cues, cross-cultural study, eye-tracking, face-scanning behavior
DDC Classification: 140 
RVK Classification: CQ 4000   
Document Type: Doctoralthesis
URI: https://fis.uni-bamberg.de/handle/uniba/49306
Release Date: 23. February 2021

File Description SizeFormat  
fisba49306_A3a.pdf1.58 MBAdobe PDFView/Open