Spatial Attentional Bilinear 3D Convolutional Network for Video-Based Autism Spectrum Disorder Detection

Video-based Autism Spectrum Disorder (ASD) detection is a challenge to most video classification networks due to the high degree of similarity between categories. Bilinear pooling is a second-order method, which is widely used in fine-grained visual recognition. However, the average summation in bilinear pooling limits its ability to perceive spatial information, which is detrimental to fine-grained visual recognition. In this paper, we propose spatial attentional bilinear pooling to enhance its spatial information extraction without significantly increasing the parameters. Further, we propose a fine-grained action recognition network named SA-B3D with LSTM model for video-based ASD detection. The proposed model can focus on more discriminative regions dynamically and effectively. Compared with state-of-the-art models, the proposed model achieves significant improvement on video-based ASD dataset.

[1]  Xiongkuo Min,et al.  Video-Based Early ASD Detection via Temporal Pyramid Networks , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[2]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[5]  Subhransu Maji,et al.  Improved Bilinear Pooling with CNNs , 2017, BMVC.

[6]  Wei Xu,et al.  ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[7]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[11]  Vittorio Murino,et al.  Video Gesture Analysis for Autism Spectrum Disorder Detection , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[12]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[13]  B. Leventhal,et al.  The Autism Diagnostic Observation Schedule—Generic: A Standard Measure of Social and Communication Deficits Associated with the Spectrum of Autism , 2000, Journal of autism and developmental disorders.

[14]  Xinge You,et al.  Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition , 2018, ECCV.

[15]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[20]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .