Team DEEP-HRI Moments in Time Challenge 2018 Technical Report

Video-based action recognition is challenging as spatial and temporal reasonings are involved jointly. We propose a novel multi-view convolutional architecture, which performs 2D convolution along three orthogonal views of volumetric video data. With weight sharing, it is capable of encoding spatio-temporal feature of video clips efficiently, and achieves superior performance over state-of-the-art spatio-temporal feature learning architectures. Furthermore, we also explore the auditory modality, which is complementary to visual clues. Our final submission to the Moments in Time challenge 2018 is an ensemble of several visual RGB and audio models, achieving a top-1 accuracy of 38.7% and top-5 66.9% on the validation set.

[1]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[9]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tatsuya Harada,et al.  Learning environmental sounds with end-to-end convolutional neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[14]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.