A Three-Level Evaluation Protocoll for Acoustic Scene Understanding of Large Language Audio Models

Dilip, HarishHarishDilipAbeßer, JakobJakobAbeßer0000-0003-4689-79442026-02-022026-02-022026https://fis.uni-bamberg.de/handle/uniba/112887Reaching a semantic understanding of complex acoustic scenes requires computational models to capture the temporal-spatial sound source composition as well as individual sound events. This is a great challenge for computational models due to the large variety of everyday sound events and the extensive temporal-spectral overlap in real-life acoustic scenes. In this work, we aim to evaluate the acoustic scene understanding capabilities of two large audio-language models (LALMs). As a challenging scenario, we use the USM dataset, which features synthetic urban soundscapes with 2-6 overlapping sound sources per mixture. Our main contribution is a novel three-layer evaluation protocol, which includes four analysis tasks for low-level sound event understanding (sound event tagging), mid-level understanding and reasoning (sound polyphony estimation, sound source loudness ranking), as well as high-level scene understanding (audio captioning). We apply standardized metrics to assess the model's performances for each task. The proposed mulit-layer protocol allows for a fine-grained analysis of model behavior across soundscapes of various complexity levels. Our results indicate that despite their rmarkable controllability using textual instructions, the ability of state-of-the-art LALMs to understand acoustic scenes is still limited as the performance on individual analysis tasks desgrades with increasing sound polyphony.engaudio captioninglarge audio-language modelssound event taggingsound polyphony estimationsound source loudness rankingA Three-Level Evaluation Protocoll for Acoustic Scene Understanding of Large Language Audio Modelsconferenceobjecturn:nbn:de:bvb:473-irb-112887x