Enhancing Multiscale Features for Efficient Acoustic Scene Classification with One-Dimensional Separate CNN

He, YuxuanYuxuanHeRaake, AlexanderAlexanderRaakeAbeßer, JakobJakobAbeßer0000-0003-4689-79442025-11-242025-11-242025978-84-09-77652-8https://fis.uni-bamberg.de/handle/uniba/111568Acoustic Scene Classifcation (ASC) is a fundamental task in audio signal processing, aiming to classify the location of an environmental audio recording. Recent advances focus on improving ASC model effciency, particularly in resource-constrained environments. Convolutional neural networks (CNNs) remain the dominant approach due to their high performance, with recent focus on 1D kernels, such as in the Time-Frequency Separate Network (TFSepNet), for reducing model complexity and computational cost. However,TFSepNet performs feature extraction using a fixed receptive field in both time and frequency dimensions, which restrictsits ability to capture multiscale contextual patterns. In this study, we investigate the integration of multiscale feature extraction modules into TF-SepNet, with the aim of improving model effciency by balancing accuracy and complexity. We propose three architectures, TFSepDCD-Net, TFSepSPP-Net, andTFSepASPP-Net, each with two architectural variants based on TF-SepNet: one replaces its max pooling layers, and the other replaces its final convolutional layer. Each architecture has three confgurations corresponding to different model sizes - small, medium, and large - to explore the tradeoff between accuracy and model complexity. Our experiments show that in corporating multiscale modules allows smaller models to achieve comparable or even superior accuracy to larger baselines. These fndings highlight the potential of multiscale representations for improving the effciency of CNN-based ASC systems, especially in 1D separate architectures like TF-SepNet.engacoustic scene classifcationtime-frequency separate network1D kernelsmultiscale feature extractionEnhancing Multiscale Features for Efficient Acoustic Scene Classification with One-Dimensional Separate CNNconferenceobject10.5281/zenodo.17251589