Options
A Three-Level Evaluation Protocoll for Acoustic Scene Understanding of Large Language Audio Models
Dilip, Harish; Abeßer, Jakob (2026): A Three-Level Evaluation Protocoll for Acoustic Scene Understanding of Large Language Audio Models, in: Bamberg: Otto-Friedrich-Universität, S. 185–189.
Faculty/Chair:
Author:
Publisher Information:
Year of publication:
2026
Pages:
Source/Other editions:
Emmanouil Benetos, Frederic Font, Magdalena Fuentes, u. a. (Hrsg.), Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025), October 2025, zenodo, 2025, S. 185–189, ISBN: 978-84-09-77652-8
Year of first publication:
2025
Language:
English
Abstract:
Reaching a semantic understanding of complex acoustic scenes requires computational models to capture the temporal-spatial sound source composition as well as individual sound events. This is a great challenge for computational models due to the large variety of everyday sound events and the extensive temporal-spectral overlap in real-life acoustic scenes. In this work, we aim to evaluate the acoustic scene understanding capabilities of two large audio-language models (LALMs). As a challenging scenario, we use the USM dataset, which features synthetic urban soundscapes with 2-6 overlapping sound sources per mixture. Our main contribution is a novel three-layer evaluation protocol, which includes four analysis tasks for low-level sound event understanding (sound event tagging), mid-level understanding and reasoning (sound polyphony estimation, sound source loudness ranking), as well as high-level scene understanding (audio captioning). We apply standardized metrics to assess the model's performances for each task. The proposed mulit-layer protocol allows for a fine-grained analysis of model behavior across soundscapes of various complexity levels. Our results indicate that despite their rmarkable controllability using textual instructions, the ability of state-of-the-art LALMs to understand acoustic scenes is still limited as the performance on individual analysis tasks desgrades with increasing sound polyphony.
Keywords: ; ; ; ;
audio captioning
large audio-language models
sound event tagging
sound polyphony estimation
sound source loudness ranking
Peer Reviewed:
Yes:
International Distribution:
Yes:
Type:
Conferenceobject
Activation date:
February 2, 2026
Permalink
https://fis.uni-bamberg.de/handle/uniba/112887