论文信息 - Learning Surrogates via Deep Embedding

Learning Surrogates via Deep Embedding

This paper proposes a technique for training a neural network by minimizing a surrogate loss that approximates the target evaluation metric, which may be non-differentiable. The surrogate is learned via a deep embedding where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. The effectiveness of the proposed technique is demonstrated in a post-tuning setup, where a trained model is tuned using the learned surrogate. Without a significant computational overhead and any bells and whistles, improvements are demonstrated on challenging and practical tasks of scene-text recognition and detection. In the recognition task, the model is tuned using a surrogate approximating the edit distance metric and achieves up to $39\%$ relative improvement in the total edit distance. In the detection task, the surrogate approximates the intersection over union metric for rotated bounding boxes and yields up to $4.25\%$ relative improvement in the $F_{1}$ score.

[1] Seong Joon Oh,et al. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Andrew Zisserman,et al. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[3] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4] Palaiahnakote Shivakumara,et al. A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[5] Peter Reinartz,et al. Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery , 2018, ACCV.

[6] Xiaoou Tang,et al. Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] P. Pérez,et al. SoDeep: A Sorting Deep Net to Learn Ranking Loss Surrogates , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Tamir Hazan,et al. Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[9] Yang Song,et al. Training Deep Neural Networks via Direct Loss Minimization , 2015, ICML.

[10] Yang Wang,et al. Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation , 2016, ISVC.

[11] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12] Tianqi Chen,et al. Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[13] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[15] Zhenyu He,et al. The Seventh Visual Object Tracking VOT2019 Challenge Results , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[16] R. Manmatha,et al. Deep Perceptual Compression , 2019, ArXiv.

[17] Jiri Matas,et al. E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text , 2018, ACCV Workshops.

[18] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[19] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[21] Jon Almazán,et al. ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[22] Luca Benini,et al. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations , 2017, NIPS.

[23] Gattigorla Nagendar,et al. Neuro-IoU: Learning a Surrogate Loss for Semantic Segmentation , 2018, BMVC.

[24] Ernest Valveny,et al. ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[25] Bernt Schiele,et al. Loss Functions for Top-k Error: Analysis and Insights , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Xiangyang Xue,et al. Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[27] Jitendra Malik,et al. Learning to Optimize Neural Nets , 2017, ArXiv.

[28] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[29] Tatsuya Harada,et al. Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Frank Hutter,et al. Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[31] Tara N. Sainath,et al. Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Andrew Zisserman,et al. Smooth Loss Functions for Deep Top-k Classification , 2018, ICLR.

[33] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[34] Frank Hutter,et al. Neural Architecture Search , 2019, Automated Machine Learning.

[35] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[36] Junjie Yan,et al. FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Jiri Matas,et al. ICDAR2017 Robust Reading Challenge on COCO-Text , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[38] Jiebo Luo,et al. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[40] Michael S. Ryoo,et al. AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures , 2019, ICLR.

[41] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[42] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[43] Jooyoung Lee,et al. Context-adaptive Entropy Model for End-to-end Optimized Image Compression , 2018, ICLR.

[44] C. V. Jawahar,et al. Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Wei Liu,et al. STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition , 2016, BMVC.

[46] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47] Kai Wang,et al. End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[48] Simon M. Lucas,et al. ICDAR 2003 robust reading competitions , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[49] Palaiahnakote Shivakumara,et al. Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[50] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[51] Eric Brachmann,et al. BOP: Benchmark for 6D Object Pose Estimation , 2018, ECCV.

[52] Yash Patel,et al. Hierarchical Auto-Regressive Model for Image Compression Incorporating Object Saliency and a Deep Perceptual Loss , 2020, ArXiv.

[53] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] David Minnen,et al. Variational image compression with a scale hyperprior , 2018, ICLR.

[55] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[56] Wafa Khlif,et al. ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition — RRC-MLT-2019 , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[57] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[58] Ankush Gupta,et al. Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Christian Ledig,et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[61] Yuning Jiang,et al. UnitBox: An Advanced Object Detection Network , 2016, ACM Multimedia.

[62] Silvio Savarese,et al. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Lars Schmidt-Thieme,et al. Learning Surrogate Losses , 2019, ArXiv.

[64] Xiang Bai,et al. Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Zhou Wang,et al. Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.