Object tracking in video with TensorFlow

This Thesis[13] was born as collaboration between the BSC Computer Science Department [5] and the UPC Image Processing Group [23], with the purpose to develop an hybrid thesis on Deep Learning. Nowadays, the interest around Machine Learning, is one of the fastest growing. So far from the side of the BSC Computer Science Department [5], that mainly uses his computational power for data mining and modelling analysis, the main purpose was to verify the difficulty to adapt his infrastructure “asterix”, from the GPU Center of Excellence at BSC/UPC [4], the to Deep Learning. Instead, from the side of UPC IPG, there was the interest to test the environment developing a model for Object Tracking in Video that was suitable for the ILSVRC VID challenge [43]. To achieve the first goal and analyze the workload on the machine, I started to become an active user of the TensorFlow [21] community, learning from posts and blogs and I decided to implement a Virtual Environment that, led us to use different dependencies and different versions of the library software, depending on the model and purpose to reach. Till now, from the computer science point of view, this environment was the best choice and the most useful experience to work with, showing the easiness of use and implementation. I had some problems only with third part libraries, specifics for the Visual Recognition area, like OpenCV. To develop the model for VID challenge, I began learning the basic knowledge for Deep Learning concepts, through on-line courses as the one of Stanford. Then I passed to the deepest and complex knowledge regarding the Visual Recognition topic, reading papers and understanding the main strategies and models that would be useful to me later, during the development. The discovery of many new concepts gave me enthusiasm and scared me at the same time, theory and practice were complementary, but it wasn’t easy to pass from the first to the second one. These latter were the most difficult part of the project, because it wasn’t enough adapting my previous knowledge and programming skills to the new ones and mainly to the TensorFlow Environment. The Python library due to its recent birth hasn’t developed many models or components yet, as for others environments like Caffe [3] or Theano[50], but the community interest is growing so fast that, luckily, I didn’t had to start from scratch. I used some available models directly from Google [1] and some GitHub Project, like TensorBox[44] from a Stanford Phd Student[45] and tested others, like YOLO TF version[16]. The main components were some of the GitHub project I found, but none of them left me withouth problems. Due to the time constraints, I started trying to extend OverFeat (TensorBox Project) [44] from single to multi class, spending efforts and many time trying to make the difference, increasing the quality of the model, on which I made also some important contribution, solving some of the main code pitfalls. At the end, the big reverse engineering work I realized with the help of the author, it didn’t give the expected results. So I had to change the main architecture composition, using other strategies and introducing other redundant components to achieve a theoretically still image detection model. I had to introduce a time and space analysis to correlate results between frames and be more consistent in the detection and tracking of the objects themselves. Starting form the modular architecture proposed by K. Kang et al.[32], I decided to use the single class OverFeat implementation as General Object detector, training it on the whole class set and followed it with other components. After the General Detector, I implemented a Tracker & Smoother to be more consistent in shape and motion during time and space on the whole frames set, using the knowledge of Slow and Steady features analysis explained by Dinesh Jayaraman and Kristen Grauman [31]. Inception component, the final one, is the most redundant module, because it’s at the base of OverFeat architecture; but its use was the faster and only solution to label easily each objects. Thanks to the available model [20] implemented by Google [1], that it was trained on the ILSVRC Classification task, I had only to retrain it on a really smaller class set, thirty instead of one hundred, a workload sustainable by any personal computer available on the market. The complete architecture was composed by three main components in the following order of connection: General Detector, Tracker and Smoother, Inception. Finally, I reached a working Environment and Model, that led me to submit results for the challenge, evaluate the workload for the “asterix” infrastructure, from GPU Center of Excellence at BSC/UPC, and prove how a Department can adapt and develop a working Deep Learning Research and Development area in few months. The methodology I used could be defined Fast Learning and Fast Developing. Since I had to start everything from scratch, first of all, the necessary basic theory knowledge, after, the most complex and specific one and finally implement them in the short time interval possible, wasn’t easy at all. These reasons according with the time constraints, pushed me to learn and develop in the fastest possible way, using tricks and tips and available components, saving time for the error solving. This latter is a consistent part of my work, solving run and project problems, which took me sometimes hours, sometimes entire days and once caused me a system-crash of ”asterix” infrastructure, suffering ten days of black out during August, due to the summer period. The model reached the last position in the VID ILSVRC competition, because of its low mAP results. As I will explain, the source of these results is the first component, that afterwards hasn’t modules able to boost its accuracy; at the same time I will highlight how a different order of components and better implementation of them, for example a trainable Tracker plus Smoother, can be the starting improvements for this first draft work. Moreover, some other precautions can be taken on the Dataset for the train of single components, boosting their accuracy; I only trained the provided Train database, without using tricks and tips on it. The thesis goals were fully reached in the best way I could, solving and walking through a path full of pitfalls and problems, which made my project harder to end.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[3]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[10]  Paolo Giorgini Laureando,et al.  UNIVERSITA’ DEGLI STUDI DI TRENTO , 2002 .

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[13]  Shuicheng Yan,et al.  Seq-NMS for Video Object Detection , 2016, ArXiv.

[14]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Kristen Grauman,et al.  Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[19]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.