Browse by author
Lookup NU author(s): Dr Xing Kek, Professor Cheng Chin
This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Author. The key towards a low complexity model for convolution neural network is in controlling the number of parameters of the network and ensuring that the input representation is not extremely large. Hence, to tackle low complexity for acoustic scene classification (ASC), this paper proposed an enhanced wavelet scattering representation with a combination of mobile network modules and shuffling modules. While wavelet scattering is made up of wavelet transform with multiple wavelet scales, the averaging operation to make the wavelet scattering invariant to translation limit the maximum timescale. Hence, wavelet scattering is affected by Heisenberg Uncertainty Principle. However, creating an input representation with multiple timescales does not meet the brief of low complexity modelling. Hence, we proposed a simple mixing of the different timescales of the first order and second order. The result is an input representation with nearly the same dimension as the normal wavelet scattering, but with enhanced multiscale. To further leverage on the ’interleaved’ wavelet scattering, this paper presents sub-spectral shuffling, which is inspired by shuffling modules that use stochasticity to improve the model’s generalization. Unlike channel shuffle that shuffles channel-wise and spatial shuffle that shuffles pixel-wise, sub-spectral shuffle aims at shuffling the feature maps frequency-wise with the concept of binning. Each bin is being shuffled such that the high-frequency spectrum is shuffled to low-frequency spectrum position allowing the model to learn the general acoustic profile of a scene rather than memorizing what is happening at the low-frequency or high-frequency spectrum, which is erratic for ASC. In addition, this paper also studied temporal shuffling, which shuffles the feature maps temporal-wise, and evaluated sub-spectral shuffling, temporal shuffling, and channel shuffling individually. Our results demonstrated the superiority of sub-spectral shuffling and the modularity of shuffling modules. We then evaluate various combinations of the three shuffling modules on three acoustic scene classification datasets. Our best model combines the three shuffling modules and achieves 70.6% classification accuracy on DCASE 2021 Task 1a dataset, 82.15% on ESC-50 dataset, 81% on Urbansound8K, with ~65K parameters and a size of 126.6KB. In addition, the inclusion of shuffling modules has shown to increase the model performance and sub-spectral shuffling is especially useful in improving logloss, which is a metric used to determine the confident level of the model.
Author(s): Kek XY, Chin CS, Li Y
Publication type: Article
Publication status: Published
Journal: IEEE Access
Year: 2022
Volume: 10
Pages: 82185-82201
Online publication date: 04/08/2022
Acceptance date: 30/07/2022
Date deposited: 24/08/2022
ISSN (electronic): 2169-3536
Publisher: Institute of Electrical and Electronics Engineers Inc.
URL: https://doi.org/10.1109/ACCESS.2022.3196338
DOI: 10.1109/ACCESS.2022.3196338
Altmetrics provided by Altmetric