A Three-Level Evaluation Protocoll for Acoustic Scene Understanding of Large Language Audio Models

Dilip, Harish; Abeßer, Jakob

Faculty/Chair:

Computational Humanities

Author:

Dilip, Harish

;

Abeßer, Jakob

Publisher Information:

Bamberg : Otto-Friedrich-Universität

Year of publication:

2026

Pages:

185-189

Source/Other editions:

Emmanouil Benetos, Frederic Font, Magdalena Fuentes, u. a. (Hrsg.), Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025), October 2025, zenodo, 2025, S. 185–189, ISBN: 978-84-09-77652-8

Year of first publication:

2025

Language:

English

Licence:

Creative Commons - CC BY - Attribution 4.0 International

URN:

urn:nbn:de:bvb:473-irb-112887x

Abstract:

Reaching a semantic understanding of complex acoustic scenes requires computational models to capture the temporal-spatial sound source composition as well as individual sound events. This is a great challenge for computational models due to the large variety of everyday sound events and the extensive temporal-spectral overlap in real-life acoustic scenes. In this work, we aim to evaluate the acoustic scene understanding capabilities of two large audio-language models (LALMs). As a challenging scenario, we use the USM dataset, which features synthetic urban soundscapes with 2-6 overlapping sound sources per mixture. Our main contribution is a novel three-layer evaluation protocol, which includes four analysis tasks for low-level sound event understanding (sound event tagging), mid-level understanding and reasoning (sound polyphony estimation, sound source loudness ranking), as well as high-level scene understanding (audio captioning). We apply standardized metrics to assess the model's performances for each task. The proposed mulit-layer protocol allows for a fine-grained analysis of model behavior across soundscapes of various complexity levels. Our results indicate that despite their rmarkable controllability using textual instructions, the ability of state-of-the-art LALMs to understand acoustic scenes is still limited as the performance on individual analysis tasks desgrades with increasing sound polyphony.

Keywords:

audio captioning

;

large audio-language models

;

sound event tagging

;

sound polyphony estimation

;

sound source loudness ranking

Peer Reviewed:

Yes:

International Distribution:

Yes:

Type:

Conferenceobject

URI:

https://fis.uni-bamberg.de/handle/uniba/112887

Activation date:

February 2, 2026

Permalink https://fis.uni-bamberg.de/handle/uniba/112887

FIS

Full text/File(s)

Question on publication

Options

Full text/File(s)

Question on publication