Options
Enhancing Multiscale Features for Efficient Acoustic Scene Classification with One-Dimensional Separate CNN
He, Yuxuan; Raake, Alexander; Abeßer, Jakob (2025): Enhancing Multiscale Features for Efficient Acoustic Scene Classification with One-Dimensional Separate CNN, in: Emmanouil Benetos, Frederic Font, Magdalena Fuentes, u. a. (Hrsg.), Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025), October 2025, Zenodo, S. 125–129, doi: 10.5281/zenodo.17251589.
Faculty/Chair:
Author:
Title of the compilation:
Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025), October 2025
Editors:
Benetos, Emmanouil
Font, Frederic
Fuentes, Magdalena
Martin Morato, Irene
Rocamora, Martín
Conference:
10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) ; Barcelona
Publisher Information:
Year of publication:
2025
Pages:
ISBN:
978-84-09-77652-8
Language:
English
Abstract:
Acoustic Scene Classifcation (ASC) is a fundamental task in audio signal processing, aiming to classify the location of an environmental audio recording. Recent advances focus on improving ASC model effciency, particularly in resource-constrained environments. Convolutional neural networks (CNNs) remain the dominant approach due to their high performance, with recent focus on 1D kernels, such as in the Time-Frequency Separate Network (TFSepNet), for reducing model complexity and computational cost. However,TFSepNet performs feature extraction using a fixed receptive field in both time and frequency dimensions, which restrictsits ability to capture multiscale contextual patterns. In this study, we investigate the integration of multiscale feature extraction modules into TF-SepNet, with the aim of improving model effciency by balancing accuracy and complexity. We propose three architectures, TFSepDCD-Net, TFSepSPP-Net, andTFSepASPP-Net, each with two architectural variants based on TF-SepNet: one replaces its max pooling layers, and the other replaces its final convolutional layer. Each architecture has three confgurations corresponding to different model sizes - small, medium, and large - to explore the tradeoff between accuracy and model complexity. Our experiments show that in corporating multiscale modules allows smaller models to achieve comparable or even superior accuracy to larger baselines. These fndings highlight the potential of multiscale representations for improving the effciency of CNN-based ASC systems, especially in 1D separate architectures like TF-SepNet.
Keywords: ; ; ;
acoustic scene classifcation
time-frequency separate network
1D kernels
multiscale feature extraction
Peer Reviewed:
Yes:
International Distribution:
Yes:
Type:
Conferenceobject
Activation date:
November 24, 2025
Versioning
Question on publication
Permalink
https://fis.uni-bamberg.de/handle/uniba/111568