ÜberNet: a ‘Universal’ CNN for the Joint Treatment of Low-, Mid-, and High-Level Vision Problems

Computer vision involves a host of tasks, such as boundary detection, semantic segmentation, surface estimation, object detection, image classification, to name a few. While these problems are typically tackled by different Deep Convolutional Neural Networks (DCNNs) a joint treatment of those can result not only in simpler, faster, and better systems, but will also be a catalyst for reaching out to other fields. Such all-in-one architectures will become imperative when considering general AI, involving for instance robots that will be able to recognize objects, navigate towards them, and manipulate them. Furthermore, having a single visual module to address a multitude of tasks will make it possible to explore methods that improve performance on all of them, rather than developping techniques that only apply to limited problems. In this work we develop a single broad-and-deep architecture to jointly address low-, middle-, and highlevel tasks of computer vision. This ‘universal’ network is intended to act like a ‘swiss knife’ for vision tasks, and we call it an ÜberNet to indicate its overarching nature. Our current architecture has been systematically evaluated on the following tasks (i) boundary detection (ii) normal estimation (iii) saliency estimation (iv) semantic segmentation and (v) proposal generation and object detection. We are currently in the process of integrating the tasks of (vi) symmetry detection (vii) depth estimation (viii) semantic part segmentation (ix) semantic boundary detection and (x) viewpoint estimation, establishing a full vision decathlon. Our present system operates in 0.3-0.4 seconds per frame on a GPU and delivers excellent results across all of the five first tasks. In normal estimation and saliency detection in particular we obtain results that directly compete with the most recently published techniques of [3] and [7] respectively. In object detection we improve the performance of a strong faster-rcnn baseline [10] from 78.7 mean Average Precision on the test set of PASCAL VOC 2007 to 80.3 by combining detection with segmentation, while the all-in-one network we finally propose recovers the original score of 78.7. Our starting point is our recent work on using deep learning for boundary detection [5] where we have shown that a multi-resolution DCNN can yield substantial improvements on the task of boundary detection. Moving on to more tasks, we originally observed that with minor architectural changes the same network can be used to solve quite distinct low/mid-level vision tasks, such as normal estimation or saliency detection with performance that directly competes or outperforms the current state-of-the-art on each of those tasks. We attribute this success to the use of (i) side layers and deep supervision, as proposed in [11] (ii) the use of a multi-scale architecture that is trained end-to-end, as proposed in [5] and (iii) the use of batch-normalization at the side layers, which we started using after [5]. A complementary line of progress has been the fusion of segmentation and detection; for this we use a ‘two-headed’ network, where one head is fully-convolutional until the end, and addresses the semantic segmentation task, as in [1, 5, 9], while the other relies on region proposals to process a shortlist of interesting positions, as in [10]. Our first experiments in this direction demonstrated that the two tasks can benefit from each other by being solved jointly in two ways; firstly, just training a network that performs two tasks turned out to improve performance and secondly by interleaving the segmentation and detection tasks we have been able to provide information from one task to the other. Our current results indicate that through this procedure we can improve detection performance from 78.7 to 80.3 on the PASCAL VOC test set, as shown in Table. 1. Even though it was relatively straightforward to train the lowand highlevel task networks in isolation, it turned out to be much more challenging to jointly tackle all of the problems above. At first sight this seems similar to simple training a network with a few more losses for additional tasks, as recently done e.g. in [2, 3, 4]. However, we had to face several technical challenges, the most important of which has been the absence of datasets with annotations for all of the tasks considered. High-level annotation is Input Boundary Normal Saliency Semantic Detection Estimation Detection Segmentation Figure 1: Dense labelling results for four tasks obtained by a single multiresolution ÜberNet; detection results are not visualized but evaluated below.

[1]  Marc Pollefeys,et al.  Discriminatively Trained Dense Surface Normal Estimation , 2014, ECCV.

[2]  Iasonas Kokkinos,et al.  Pushing the Boundaries of Boundary Detection using Deep Learning , 2015, ICLR 2016.

[3]  James M. Rehg,et al.  The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[5]  Yizhou Yu,et al.  Deep Contrast Learning for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).