DNN-based Environmental Sound Recognition with Real-recorded and Artificially-mixed Training Data

In this paper, we report on our investigation of environmental sound recognition using a deep neural network (DNN). Preparing a sufficient amount of training data is generally important in machine learning On the other hand, different environmental sounds such as cicada and ambulance sounds occur while overlapping each other. As a result, training data including mixtures of different sounds are necessary for environmental sound recognition. However, it is difficult to obtain all combinations of different sounds in real-recorded data. In this study, we increased the amount of training data using artificially-mixed sounds. First, some distinctive single sounds which were recorded on different days near the sound sources individually were selected, or others were separated from the real-recorded data by extracting appropriate parts in the time domain. Those sounds which mainly consisted of a single sound source were applied filters to reduce other sounds in the frequency domain. Next, they were mixed with different ratios of sound levels, simulating a variation of possible mixings in the real environment. Finally, both the real-recorded data and the artificially-mixed sound data were used for training the DNN and we then conducted on environmental sound recognition. We show that this approach achieves more accurate results than the ones using only real-recorded data.