Towards efficient and objective work sampling: Recognizing workers' activities in site surveillance videos with two-stream convolutional networks

Abstract Capturing the working states of workers on foot allows managers to precisely quantify and benchmark labor productivity, which in turn enables them to evaluate productivity losses and identify causes. Work sampling is a widely used method for this task, while suffers from low efficiency as only one worker is selected for each observation. Attentional selection asymmetry can also bias its uniform object selection assumption. Existing vision-based methods are primarily oriented towards recognizing single, separated activities involving few workers or equipment. In this paper, we introduce an activity recognition method, which receives surveillance videos as input and produces diverse and continuous activity labels of individual workers in the field of view. Convolutional networks are used to recognize activities, which are encoded in spatial and temporal streams. A new fusion strategy is developed to combine the recognition results of the two streams. The experimental results show that our activity recognition method has achieved an average accuracy of 80.5%, which is comparable with the state-of-the-art of activity recognition in the computer vision community, given the severe camera motion and low resolution of site surveillance videos and the marginal inter-class difference and significant intra-class variation of workers' activities. We also demonstrate that our method can underpin the implementation of efficient and objective work sampling. The training and test datasets of the study are publicly available.

[1]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Zhongke Shi,et al.  Vision-based action recognition of construction workers using dense trajectories , 2016, Adv. Eng. Informatics.

[5]  Mani Golparvar-Fard,et al.  Vision-based workface assessment using depth images for activity analysis of interior construction operations , 2014 .

[6]  Bohyung Han,et al.  Learning Multi-domain Convolutional Neural Networks for Visual Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Pichao Wang,et al.  Joint Distance Maps Based Action Recognition With Convolutional Neural Networks , 2017, IEEE Signal Processing Letters.

[8]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Feniosky Peña-Mora,et al.  Vision-Based Detection of Unsafe Actions of a Construction Worker: Case Study of Ladder Climbing , 2013, J. Comput. Civ. Eng..

[10]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[11]  Tao Cheng,et al.  Automated task-level activity analysis through fusion of real time location sensors and worker's tho , 2013 .

[12]  Juan Carlos Niebles,et al.  Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers , 2013, Adv. Eng. Informatics.

[13]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Zhongke Shi,et al.  Vision-Based Tower Crane Tracking for Understanding Construction Activity , 2014, J. Comput. Civ. Eng..

[16]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[18]  Patricio A. Vela,et al.  Improvements to Concrete Column Detection in Live Video , 2010 .

[19]  Seokho Chi,et al.  Automated Object Identification Using Optical Video Cameras on Construction Sites , 2011, Comput. Aided Civ. Infrastructure Eng..

[20]  Brenda McCabe,et al.  Automated Visual Recognition of Dump Trucks in Construction Videos , 2012, J. Comput. Civ. Eng..

[21]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[22]  Heng Li,et al.  Computer vision aided inspection on falling prevention measures for steeplejacks in an aerial environment , 2018, Automation in Construction.

[23]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Patricio A. Vela,et al.  Construction performance monitoring via still images, time-lapse photos, and video streams: Now, tomorrow, and the future , 2015, Adv. Eng. Informatics.

[25]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[27]  Seunghoon Hong,et al.  Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network , 2015, ICML.

[28]  Hyoungkwan Kim,et al.  Using Hue, Saturation, and Value Color Space for Hydraulic Excavator Idle Time Analysis , 2007 .

[29]  Jie Gong,et al.  Computer Vision-Based Video Interpretation Model for Automated Productivity Analysis of Construction Operations , 2010 .

[30]  Amir H. Behzadan,et al.  Smartphone-based construction workers' activity recognition and classification , 2016 .

[31]  SangUk Han,et al.  A vision-based motion capture and recognition framework for behavior-based safety management , 2013 .

[32]  Xiaochun Luo,et al.  Detecting non-hardhat-use by a deep learning method from far-field surveillance videos , 2018 .

[33]  Patricio A. Vela,et al.  Fusion of Photogrammetry and Video Analysis for Productivity Assessment of Earthwork Processes , 2017, Comput. Aided Civ. Infrastructure Eng..

[34]  Ioannis A. Kakadiaris,et al.  A Review of Human Activity Recognition Methods , 2015, Front. Robot. AI.

[35]  William R. Gibbs,et al.  Productivity in Construction , 1976 .

[36]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[37]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[38]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[39]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[40]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Peter E.D. Love,et al.  A deep hybrid learning model to detect unsafe behavior: Integrating convolution neural networks and long short-term memory , 2018 .

[42]  Sven J. Dickinson,et al.  Server-Customer Interaction Tracker: Computer Vision-Based System to Estimate Dirt-Loading Cycles , 2013 .

[43]  Todd S. Braver,et al.  Motivation and Cognitive Control , 2015 .

[44]  Ghassan Al-Regib,et al.  TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[45]  Mani Golparvar-Fard,et al.  Crowdsourcing Construction Activity Analysis from Jobsite Video Streams , 2015 .

[46]  Mani Golparvar-Fard,et al.  Automated 2D detection of construction equipment and workers from site video streams using histograms of oriented gradients and colors , 2013 .

[47]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[48]  Wael Badawy,et al.  Hard hat detection in video sequences based on face features, motion and color information , 2011, 2011 3rd International Conference on Computer Research and Development.

[49]  Jochen Teizer,et al.  Real-time construction worker posture analysis for ergonomics training , 2012, Adv. Eng. Informatics.

[50]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Brenda McCabe,et al.  Part based model and spatial–temporal reasoning to recognize hydraulic excavators in construction images and videos , 2012 .

[52]  SangHyun Lee,et al.  Computer vision techniques for construction safety and health monitoring , 2015, Adv. Eng. Informatics.

[53]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[54]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[55]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Carlos H. Caldas,et al.  Learning and classifying actions of construction workers and equipment using Bag-of-Video-Feature-Words and Bayesian network models , 2011, Adv. Eng. Informatics.

[57]  Jochen Teizer,et al.  Status quo and open challenges in vision-based sensing and tracking of temporary resources on infrastructure construction sites , 2015, Adv. Eng. Informatics.

[58]  Kristin Branson,et al.  Computational Analysis of Behavior. , 2016, Annual review of neuroscience.

[59]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[60]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.