Challenging Images For Minds and Machines

There is no denying the tremendous leap in the performance of machine learning methods in the past half-decade. Some might even say that specific sub-fields in pattern recognition, such as machine-vision, are as good as solved, reaching human and super-human levels. Arguably, lack of training data and computation power are all that stand between us and solving the remaining ones. In this position paper we underline cases in vision which are challenging to machines and even to human observers. This is to show limitations of contemporary models that are hard to ameliorate by following the current trend to increase training data, network capacity or computational power. Moreover, we claim that attempting to do so is in principle a suboptimal approach. We provide a taster of such examples in hope to encourage and challenge the machine learning community to develop new directions to solve the said difficulties.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Antonio Torralba,et al.  Transfer Learning by Borrowing Examples for Multiclass Object Detection , 2011, NIPS.

[3]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[4]  Xiaoou Tang,et al.  Surpassing Human-Level Face Verification Performance on LFW with GaussianFace , 2014, AAAI.

[5]  Jitendra Malik,et al.  Recognizing objects in adversarial clutter: breaking a visual CAPTCHA , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[6]  Frank D. Wood,et al.  Using synthetic data to train neural networks is model-based reasoning , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[7]  Shimon Ullman,et al.  From simple innate biases to complex visual concepts , 2012, Proceedings of the National Academy of Sciences.

[8]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Ohad Shamir,et al.  Failures of Deep Learning , 2017, ArXiv.

[10]  John K. Tsotsos,et al.  Cognitive programs: software for attention's executive , 2014, Front. Psychol..

[11]  John Langford,et al.  CAPTCHA: Using Hard AI Problems for Security , 2003, EUROCRYPT.

[12]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[13]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[15]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[16]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Dragomir Anguelov,et al.  Capturing Long-Tail Distributions of Object Subcategories , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Charless C. Fowlkes,et al.  Do We Need More Training Data? , 2015, International Journal of Computer Vision.

[20]  Xiangyang Luo,et al.  A Survey on Breaking Technique of Text-Based CAPTCHA , 2017, Secur. Commun. Networks.

[21]  Charless C. Fowlkes,et al.  Do We Need More Training Data or Better Models for Object Detection? , 2012, BMVC.

[22]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[24]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Ved Prakash Singh,et al.  Survey of Different Types of CAPTCHA , 2014 .

[26]  Martial Hebert,et al.  Learning to Model the Tail , 2017, NIPS.

[27]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[30]  Lei Zhang,et al.  Face Recognition via Centralized Coordinate Learning , 2018, ArXiv.

[31]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[32]  Aaron C. Courville,et al.  Learning Visual Reasoning Without Strong Priors , 2017, ICML 2017.

[33]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[34]  S. Ullman Visual routines , 1984, Cognition.

[35]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.