Enhancing Multiscale Features for Efficient Acoustic Scene Classification with One-Dimensional Separate CNN

He, Yuxuan; Raake, Alexander; Abeßer, Jakob

doi:10.5281/zenodo.17251589

He, Yuxuan; Raake, Alexander; Abeßer, Jakob (2025): Enhancing Multiscale Features for Efficient Acoustic Scene Classification with One-Dimensional Separate CNN, in: Emmanouil Benetos, Frederic Font, Magdalena Fuentes, u. a. (Hrsg.), Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025), October 2025, Zenodo, S. 125–129, doi: 10.5281/zenodo.17251589.

Faculty/Chair:

Computational Humanities

Author:

He, Yuxuan

;

Raake, Alexander

;

Abeßer, Jakob

Title of the compilation:

Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025), October 2025

Editors:

Benetos, Emmanouil

Font, Frederic

Fuentes, Magdalena

Martin Morato, Irene

Rocamora, Martín

Conference:

10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) ; Barcelona

Publisher Information:

Zenodo

Year of publication:

2025

Pages:

125-129

ISBN:

978-84-09-77652-8

Language:

English

DOI:

10.5281/zenodo.17251589

Abstract:

Acoustic Scene Classifcation (ASC) is a fundamental task in audio signal processing, aiming to classify the location of an environmental audio recording. Recent advances focus on improving ASC model effciency, particularly in resource-constrained environments. Convolutional neural networks (CNNs) remain the dominant approach due to their high performance, with recent focus on 1D kernels, such as in the Time-Frequency Separate Network (TFSepNet), for reducing model complexity and computational cost. However,TFSepNet performs feature extraction using a fixed receptive field in both time and frequency dimensions, which restrictsits ability to capture multiscale contextual patterns. In this study, we investigate the integration of multiscale feature extraction modules into TF-SepNet, with the aim of improving model effciency by balancing accuracy and complexity. We propose three architectures, TFSepDCD-Net, TFSepSPP-Net, andTFSepASPP-Net, each with two architectural variants based on TF-SepNet: one replaces its max pooling layers, and the other replaces its final convolutional layer. Each architecture has three confgurations corresponding to different model sizes - small, medium, and large - to explore the tradeoff between accuracy and model complexity. Our experiments show that in corporating multiscale modules allows smaller models to achieve comparable or even superior accuracy to larger baselines. These fndings highlight the potential of multiscale representations for improving the effciency of CNN-based ASC systems, especially in 1D separate architectures like TF-SepNet.

Keywords:

acoustic scene classifcation

;

time-frequency separate network

;

1D kernels

;

multiscale feature extraction

Peer Reviewed:

Yes:

International Distribution:

Yes:

Type:

Conferenceobject

URI:

https://fis.uni-bamberg.de/handle/uniba/111568

Activation date:

November 24, 2025

Permalink https://fis.uni-bamberg.de/handle/uniba/111568

FIS

Versioning

Question on publication

Options

Versioning

Question on publication