Automated generation of convolutional neural network training data using video sources

One of the challenges of using techniques such as convolutional neural networks and deep learning for automated object recognition in images and video is to be able to generate sufficient quantities of labeled training image data in a cost-effective way. It is generally preferred to tag hundreds of thousands of frames for each category or label, and a human being tagging images frame by frame might expect to spend hundreds of hours creating such a training set. One alternative is to use video as a source of training images. A human tagger notes the start and stop time in each clip for the appearance of objects of interest. The video is broken down into component frames using software such as ffmpeg. The frames that fall within the time intervals for objects of interest are labeled as “targets,” and the remaining frames are labeled as “non-targets.” This separation of categories can be automated. The time required by a human viewer using this method would be around ten hours, at least 1–2 orders of magnitude lower than a human tagger labeling frame by frame. The false alarm rate and target detection rate can by optimized by providing the system unambiguous training examples.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Josh Harguess,et al.  Automated video quality measurement based on manmade object characterization and motion detection , 2016, SPIE Defense + Security.

[4]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).