Training Neural Networks to Produce Compatible Features

In computer vision we often train a different neural network for each task, while reuse of existing knowledge typically remains limited to ImageNet pre-training. However, human knowledge is composable and reusable (e.g. [16]). Therefore it seems prudent to give neural networks these properties too. For example, adding a new class should only require training parts of the network, as is done in incremental learning [6] and few-shot learning [14, 18]. Furthermore, when a network can recognize cars in daylight, this knowledge should help recognizing cars by night through domain adaptation [15]. We believe that a general way to achieve network reusability is to have a large set of compatible network components which are specialized for different tasks: Some would extract features from RGB images, depth images, or optical flow fields. Other components could use these features to classify animals, detect cars, or segment roads. The compatibility of the components makes it easy to mix and match them for the task at hand. Besides few-shot learning and domain adaptation, this would also enable training a single classifier which can be deployed on various devices, each with its own hardware-specific backbone network. To explore if it is feasible to obtain a set of compatible components, we ask: is it possible to train network components so that they are compatible directly after training? This question is related to works which investigate whether different networks trained on the same data learn similar representations, even when they are trained independently [10, 11, 12, 17, 9, 7]. This would allow to recombine components of different networks by learning a simple mapping from the feature representation space of one network to the space of the other [9]. If this mapping is a simple permutation, it means that the networks learn exactly the same features but in different order. In practice, low-level features tend to be repeatedly learned [9], while high-level features learn feature spaces which cannot be mapped through a simple transformation [11, 17, 7, 9]. Instead of such a post-hoc analysis, we want to directly optimize networks to learn compatible features without the need for determining any mapping afterwards. Figure 1: Experimental setup with auxiliary task head. We train two networks a and b. We regularize training by adding an auxiliary task head, which may discriminate common classes or predict rotation. This forces both feature extractors to be compatible with the auxiliary task head, effectively aligning their features. This makes feature extractor b compatible with classification head a, and vice versa.

[1]  Po-Hsuan Chen,et al.  Shared Representational Geometry Across Neural Networks , 2018, ArXiv.

[2]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[3]  Liwei Wang,et al.  Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation , 2018, NeurIPS.

[4]  Hod Lipson,et al.  Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.

[5]  Alexei A. Efros,et al.  Unsupervised Domain Adaptation through Self-Supervision , 2019, ArXiv.

[6]  Alexander Kolesnikov,et al.  S4L: Self-Supervised Semi-Supervised Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Samy Bengio,et al.  Are All Layers Created Equal? , 2019, J. Mach. Learn. Res..

[9]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[10]  Subhransu Maji,et al.  Boosting Supervision with Self-Supervision for Few-shot Learning , 2019, ArXiv.

[11]  Charles Kemp,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[12]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[14]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[15]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[16]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[17]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.