BattleSound: A Game Sound Benchmark for the Sound-Specific Feedback Generation in a Battle Game

A haptic sensor coupled to a gamepad or headset is frequently used to enhance the sense of immersion for game players. However, providing haptic feedback for appropriate sound effects involves specialized audio engineering techniques to identify target sounds that vary according to the game. We propose a deep learning-based method for sound event detection (SED) to determine the optimal timing of haptic feedback in extremely noisy environments. To accomplish this, we introduce the BattleSound dataset, which contains a large volume of game sound recordings of game effects and other distracting sounds, including voice chats from a PlayerUnknown’s Battlegrounds (PUBG) game. Given the highly noisy and distracting nature of war-game environments, we set the annotation interval to 0.5 s, which is significantly shorter than the existing benchmarks for SED, to increase the likelihood that the annotated label contains sound from a single source. As a baseline, we adopt mobile-sized deep learning models to perform two tasks: weapon sound event detection (WSED) and voice chat activity detection (VCAD). The accuracy of the models trained on BattleSound was greater than 90% for both tasks; thus, BattleSound enables real-time game sound recognition in noisy environments via deep learning. In addition, we demonstrated that performance degraded significantly when the annotation interval was greater than 0.5 s, indicating that the BattleSound with short annotation intervals is advantageous for SED applications that demand real-time inferences.

[1]  Sungho Shin,et al.  Teaching Where to Look: Attention Similarity Knowledge Distillation for Low Resolution Face Recognition , 2022, ECCV.

[2]  Yagya Raj Pandeya,et al.  A Monophonic Cow Sound Annotation Tool using a Semi-Automatic Method on Audio/Video Data , 2022, Livestock Science.

[3]  Marco Gillies,et al.  A comparison of the effects of haptic and visual feedback on presence in virtual reality , 2021, Int. J. Hum. Comput. Stud..

[4]  Ui-Hyun Kim,et al.  Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection , 2021, Interspeech.

[5]  Bhiksha Raj,et al.  Improving weakly supervised sound event detection with self-supervised auxiliary tasks , 2021, Interspeech.

[6]  Mounya Elhilali,et al.  Self-Training for Sound Event Detection in Audio Mixtures , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Sungho Shin,et al.  Self-Supervised Transfer Learning from Natural Images for Sound Classification , 2021, Applied Sciences.

[8]  J. Kim,et al.  Rapid and non-destructive spectroscopic method for classifying beef freshness using a deep spectral network fused with myoglobin information. , 2021, Food chemistry.

[9]  Arindam Basu,et al.  Deep Neural Network for Respiratory Sound Classification in Wearable Devices Enabled by Patient Specific Model Tuning , 2020, IEEE Transactions on Biomedical Circuits and Systems.

[10]  Ryosuke Yamanishi,et al.  Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yagya Raj Pandeya,et al.  Visual Object Detector for Cow Sound Event Detection , 2020, IEEE Access.

[12]  Bruce White,et al.  Low Cost Gunshot Detection using Deep Learning on the Raspberry Pi , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[13]  Jakob Abeßer,et al.  Sounding Industry: Challenges and Datasets for Industrial Sound Analysis , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[14]  Yuma Koizumi,et al.  ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[15]  Faliang Chang,et al.  Bearing Fault Classification Based on Convolutional Neural Network in Noise Environment , 2019, IEEE Access.

[16]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Yeonguk Yu,et al.  A Voice Activity Detection Model Composed of Bidirectional LSTM and Attention Mechanism , 2018, 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM).

[18]  Soomyung Park,et al.  Convolutional Recurrent Neural Networks for Urban Sound Classification Using Raw Waveforms , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[19]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[20]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[21]  Bhiksha Raj,et al.  A Closer Look at Weak Label Learning for Audio Events , 2018, ArXiv.

[22]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[23]  Dimitrios Tzovaras,et al.  An mHealth System for Monitoring Medication Adherence in Obstructive Respiratory Diseases Using Content Based Audio Classification , 2018, IEEE Access.

[24]  Nasser Kehtarnavaz,et al.  A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection , 2018, IEEE Access.

[25]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[26]  Marian Verhelst,et al.  The SINS Database for Detection of Daily Activities in a Home Environment Using an Acoustic Sensor Network , 2017, DCASE.

[27]  Justin Salamon,et al.  Scaper: A library for soundscape synthesis and augmentation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[28]  Bhiksha Raj,et al.  Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data , 2017, ArXiv.

[29]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Nicola Gallacher,et al.  Game audio — an investigation into the effect of audio on player immersion , 2013, The Computer Games Journal.

[34]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[35]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ramprasaath R. Selvaraju,et al.  Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization , 2016 .

[38]  Jon Barker,et al.  Chime-home: A dataset for sound source recognition in a domestic environment , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[39]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[40]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[42]  Stephen H. Fairclough,et al.  Use of auditory event-related potentials to measure immersion during a computer game , 2015, Int. J. Hum. Comput. Stud..

[43]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[44]  Dan Stowell,et al.  An Open Dataset for Research on Audio Field Recording Archives: freefield1010 , 2013, Semantic Audio.

[45]  Yahya Al-Hazmi,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2014, ICPP 2014.

[46]  Jeong-Mook Lim,et al.  Haptic interaction with user manipulation for smartphone , 2013, 2013 IEEE International Conference on Consumer Electronics (ICCE).

[47]  T. C. Nicholas Graham,et al.  Exploring Haptic Feedback in Exergames , 2011, INTERACT.

[48]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Kyungkoo Jun,et al.  Sound-Specific Vibration Interface: Its Performance of Tactile Effects and Applications , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[51]  Thomas Hain,et al.  Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[52]  David Poeppel,et al.  The analysis of speech in different temporal integration windows: cerebral lateralization as 'asymmetric sampling in time' , 2003, Speech Commun..

[53]  Ahmet M. Kondoz,et al.  Improved voice activity detection based on a smoothed statistical likelihood ratio , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[54]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[55]  S. Rosen Temporal information in speech: acoustic, auditory and linguistic aspects. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[56]  M Donn,et al.  Listen and learn. , 1990, The Health service journal.