Pedestrian motion recognition via Conv-VLAD integrated spatial-temporal-relational network

Pedestrian motion recognition is one of the important components of an intelligent transportation system. Since commonly used spatial-temporal features are still not sufficient for mining deep information in frames, this study proposes a three-stream neural network called a spatial-temporal-relational network (STRN), where the static spatial information, dynamic motion and differences between adjunct keyframes are comprehensively considered as features of the video records. In addition, an optimised pooling layer called convolutional vector of locally aggregated descriptors layer (Conv-VLAD) is employed before the final classification step in each stream to better aggregate the extracted features and reduce the inter-class differences. To accomplish this, the original video records are required to be processed into RGB images, optical flow images and RGB difference images to deliver the respective information for each stream. After the classification result is obtained from each stream, a decision-level fusion mechanism is introduced to improve the network's overall accuracy via combining the partial understandings together. Experimental results on two public data sets UCF101 (94.7%) and HMDB51 (69.0%), show that the proposed method achieves significantly improved performance. The results of STRN have far-reaching significance for the application of deep learning in intelligent transportation systems to ensure pedestrian safety.

[1]  Shunsuke Kamijo,et al.  Vehicle infrastructure integration system using vision sensors to prevent accidents in traffic flow , 2011 .

[2]  Sijing Zhang,et al.  Timely and reliable packets delivery over internet of vehicles for road accidents prevention: a cross-layer approach , 2016, IET Networks.

[3]  Tarek Sayed,et al.  Automated Analysis of Pedestrian Group Behavior in Urban Settings , 2018, IEEE Transactions on Intelligent Transportation Systems.

[4]  Xiaoshuai Sun,et al.  Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length , 2018, IEEE Transactions on Multimedia.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[7]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Stefan Steinerberger,et al.  Clustering with t-SNE, provably , 2017, SIAM J. Math. Data Sci..

[10]  PorikliFatih,et al.  Going deeper into action recognition , 2017 .

[11]  Muhammad Bilal,et al.  Algorithmic optimisation of histogram intersection kernel support vector machine-based pedestrian detection using low complexity features , 2017, IET Comput. Vis..

[12]  Ning Li,et al.  ACT: an ACTNet for visual tracking , 2019, IET Image Process..

[13]  Le Minh Kieu,et al.  Deep learning methods in transportation domain: a review , 2018, IET Intelligent Transport Systems.

[14]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[15]  Nabeel Shaikh,et al.  Advanced accident prediction models and impacts assessment , 2018, IET Intelligent Transport Systems.