Motion and Appearance Based Multi-Task Learning Network for Autonomous Driving

Autonomous driving has various visual perception tasks such as object detection, motion detection, depth estimation and flow estimation. Multi-task learning (MTL) has been successfully used for jointly estimating some of these tasks. Previous work was focused on utilizing appearance cues. In this paper, we address the gap of incorporating motion cues in a multi-task learning system. We propose a novel two-stream architecture for joint learning of object detection, road segmentation and motion segmentation. We designed three different versions of our network to establish systematic comparison. We show that the joint training of tasks significantly improves accuracy compared to training them independently even with a relatively smaller amount of annotated samples for motion segmentation. To enable joint training, we extended KITTI object detection dataset to include moving/static annotations of the vehicles. An extension of this new dataset named KITTI MOD is made publicly available via the official KITTI benchmark website 1. Our baseline network outperforms MPNet which is a state of the art for single stream CNN-based motion detection. The proposed two-stream architecture improves the mAP score by 21.5% in KITTI MOD. We also evaluated our algorithm on the non-automotive DAVIS dataset and obtained accuracy close to the state-of-the-art performance. The proposed network runs at 8 fps on a Titan X GPU using a two-stream VGG16 encoder. Demonstration of the work is provided in 2.

[1]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[4]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[5]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[6]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[10]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[12]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Richard Szeliski,et al.  Video Segmentation with Background Motion Models , 2017, BMVC.

[14]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Martin Lauer,et al.  3D Traffic Scene Understanding From Movable Platforms , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  P. Torr Geometric motion segmentation and model selection , 1998, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[18]  Jitendra Malik,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[19]  Yang Gao,et al.  End-to-End Learning of Driving Models from Large-Scale Video Datasets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Roberto Cipolla,et al.  MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[21]  Thomas Brox,et al.  Object Detection, Tracking, and Motion Segmentation for Object-level Video Segmentation , 2016, ArXiv.