Early Fusion of Dense Optical Flow with Image for Semantic Segmentation in Autonomous Driving

Precise understanding of the scene around the car is of utmost importance to achieve autonomous driving. Convolutional Neural Networks (CNNs) have been widely used for road scene understanding in the last few years with great success. However, most of these networks have a complex architecture which needs a complex system to be deployed in the car. Typical systems today take the input from cameras placed around the car and the CNNs process them to provide the understanding of the environment. Various hardware manufacturers today are including hardware accelerators in their System on Chips (SoCs) for certain computer vision tasks such as Optical Flow (OF), Stereo Vision (SV) which can achieve good accuracy and fast runtime. If these accelerators can be used in tandem with the CNN to enhance the accuracy of perception, then it is hugely beneficial. In this paper, we explore the possibility of using the Dense Optical Flow output from the hardware accelerator as input along with the image for CNNs to be able to perceive the scene better and faster. We show that by fusion of optical flow and image, mean Intersection over Union (IoU) of segmentation improves by over 1% and accuracy of major classes such as road, person, rider, motorcycle and bicycle improves by 2%, 1%, 5%, 7% and 11% respectively.

[1]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Martin Jägersand,et al.  MODNet: Moving Object Detection Network with Motion and Appearance for Autonomous Driving , 2017, ArXiv.

[3]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[4]  Cordelia Schmid,et al.  Learning to detect Motion Boundaries , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jitendra Malik,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[6]  Luc Van Gool,et al.  Do motion boundaries improve semantic segmentation , 2016 .

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[9]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ioannis Stamos,et al.  CNN-Based Object Segmentation in Urban LIDAR with Missing Points , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Roberto Cipolla,et al.  MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[13]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[14]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Jianbo Shi,et al.  Semantic Segmentation with Boundary Neural Fields , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Rogério Schmidt Feris,et al.  A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection , 2016, ECCV.

[18]  Jörg Stückler,et al.  Reconstructing Street-Scenes in Real-Time from a Driving Car , 2015, 2015 International Conference on 3D Vision.

[19]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[20]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[22]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[23]  Fabrizio Cuccoli,et al.  Vehicle classification based on convolutional networks applied to FM-CW radar signals , 2017, TRAP.

[24]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).