Towards Automatic Model Specialization for Edge Video Analytics

Judging by popular and generic computer vision challenges, such as the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) or PASCAL VOC, neural networks have proven to be exceptionally accurate in some tasks, surpassing even humanlevel accuracy. However, state-of-the-art accuracy often comes at a high computational price, requiring equally state-of-the-art and high-end hardware acceleration to achieve anything near real-time performance. At the same time, use cases such as smart cities or autonomous vehicles require an automated analysis of images from fixed cameras in real-time. Due to the vast and constant network bandwidth these streams would generate, we cannot rely on offloading compute to the omnipresent and omnipotent cloud. Consequently, a distributed edge cloud should take over and process images locally. However, by nature, the edge cloud is resourceconstrained, which puts a limit on the computational complexity of the models executed in the edge. Nonetheless, there is a need for a meeting point between the edge cloud and accurate real-time video analytics. One solution is to use methods to specialize lightweight models on a per-camera basis but it quickly becomes unfeasible as the number of cameras grows unless the process is fully automated. In this paper, we present and evaluate COVA (Contextually Optimized Video Analytics), a framework to assist in the automatic specialization of models for video analytics in edge cloud cameras. COVA aims to automatically improve the accuracy of lightweight models through the automatic specialization of models. Moreover, we discuss and analyze each step involved in the process to understand the different trade-offs that each one entails. Additionally, we show how the sole assumption of static cameras allows us to make a series of considerations that greatly simplify the scope of the problem and, in turn, enables COVA to successfully specialize models using traditional computer vision techniques specifically chosen for the task. Through COVA, we show that complex neural networks, i.e., those able to generalize well, can be effectively used as teachers to annotate datasets for the specialization of lightweight neural networks and adapt them to the specific context in which they will be deployed. This allows us to tailor models to increase their accuracy while keeping their computational cost constant and do so without any human interaction. Results show that COVA can automatically improve pre-trained models by an average of 21% on the different scenes of the VIRAT dataset.

[1]  Peter Bailis,et al.  BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics , 2018, Proc. VLDB Endow..

[2]  Omer F. Rana,et al.  Edge Enhanced Deep Learning System for Large-Scale Video Stream Analytics , 2018, 2018 IEEE 2nd International Conference on Fog and Edge Computing (ICFEC).

[3]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[4]  Paolo Napoletano,et al.  Benchmark Analysis of Representative Deep Neural Network Architectures , 2018, IEEE Access.

[5]  P. Alam ‘S’ , 2021, Composites Engineering: An A–Z Guide.

[6]  Alexandre Boulch,et al.  Fully Convolutional Siamese Networks for Change Detection , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[7]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[8]  Cordelia Schmid,et al.  Incremental Learning of Object Detectors without Catastrophic Forgetting , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Deva Ramanan,et al.  Online Model Distillation for Efficient Video Inference , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  David Carrera,et al.  Performance characterization of video analytics workloads in heterogeneous edge infrastructures , 2021, Concurr. Comput. Pract. Exp..

[11]  Abhinav Gupta,et al.  Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases , 2020, NeurIPS.

[12]  Paramvir Bahl,et al.  Real-Time Video Analytics: The Killer App for Edge Computing , 2017, Computer.

[13]  Pietro Perona,et al.  Recognition in Terra Incognita , 2018, ECCV.

[14]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[15]  Shalini Ghosh,et al.  RILOD: near real-time incremental learning for object detection at the edge , 2019, SEC.

[16]  Peter Bailis,et al.  Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine , 2019, CIDR.

[17]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  P. Alam ‘L’ , 2021, Composites Engineering: An A–Z Guide.

[19]  Seungyeop Han,et al.  Fast Video Classification via Adaptive Cascading of Deep Models , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Paramvir Bahl,et al.  Focus: Querying Large Video Datasets with Low Latency and Low Cost , 2018, OSDI.

[21]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[22]  Zhidong Deng,et al.  Fully Motion-Aware Network for Video Object Detection , 2018, ECCV.

[23]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[24]  Hélène Laurent,et al.  Comparative study of background subtraction algorithms , 2010, J. Electronic Imaging.

[25]  Paramvir Bahl,et al.  VideoEdge: Processing Camera Streams using Hierarchical Clusters , 2018, 2018 IEEE/ACM Symposium on Edge Computing (SEC).

[26]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[28]  Hyeontaek Lim,et al.  Scaling Video Analytics on Constrained Edge Nodes , 2019, MLSys.

[29]  Armand Joulin,et al.  Self-supervised Pretraining of Visual Features in the Wild , 2021, ArXiv.

[30]  Gorjan Alagic,et al.  #p , 2019, Quantum information & computation.

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[33]  Weisong Shi,et al.  Edge Computing: Vision and Challenges , 2016, IEEE Internet of Things Journal.

[34]  OctoMiao Overcoming catastrophic forgetting in neural networks , 2016 .

[35]  Xu Chen,et al.  Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing , 2019, Proceedings of the IEEE.

[36]  Jonathan Huang,et al.  Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Aiman Erbad,et al.  Edge computing for interactive media and video streaming , 2017, 2017 Second International Conference on Fog and Mobile Edge Computing (FMEC).

[38]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[39]  Anton Dries,et al.  Adaptive concept drift detection , 2009, SDM.

[40]  Soon Ki Jung,et al.  Deep Neural Network Concepts for Background Subtraction: A Systematic Review and Comparative Evaluation , 2018, Neural Networks.