STRATIFIED TIME-FREQUENCY FEATURES FOR CNN-BASED ACOUSTIC SCENE CLASSIFICATION Technical Report

Acoustic scene signal is a mixture of diverse sound events, which are frequently overlapped with each other. The CNN models for acoustic scene classification usually suffer from model overfitting because they might memorize the overlapped sounds as the representative patterns for acoustic scenes, and might fail to recognize the scene when only one of the sound is present. Based on a standard CNN setup with log-Mel feature as input, we propose to stratify the log-Mel image to several component images based on sound duration, and each component image should contain a specific type of time-frequency patterns. Then we emphasize the independent modeling of time-frequency patterns to better utilize the stratified features. The experiment results on TAU Urban Acoustic Scenes 2019 development dataset [1] show that the use of stratified feature can significantly improve the classification performance.