CLIPTER: Looking at the Bigger Picture in Scene Text Recognition

Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.

[1]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[2]  Ali Furkan Biten,et al.  Out-of-Vocabulary Challenge Report , 2022, ECCV Workshops.

[3]  Oron Anschel,et al.  GLASS: Global to Local Attention for Scene-Text Spotting , 2022, ECCV.

[4]  Rowel Atienza,et al.  Scene Text Recognition with Permuted Autoregressive Sequence Models , 2022, ECCV.

[5]  Ruiyu Li,et al.  Context-Based Contrastive Learning for Scene Text Recognition , 2022, AAAI.

[6]  Hao Liu,et al.  Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition , 2022, AAAI.

[7]  Errui Ding,et al.  MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining , 2022, ArXiv.

[8]  Zhe Gan,et al.  GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[9]  Thomas Kipf,et al.  Simple Open-Vocabulary Object Detection with Vision Transformers , 2022, ArXiv.

[10]  Aviad Aberdam,et al.  Multimodal Semi-Supervised Learning for Text Recognition , 2022, ArXiv.

[11]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[12]  Peng Wang,et al.  Pushing the Performance Limit of Scene Text Recognizer without Human Annotation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  A. Bissacco,et al.  Towards End-to-End Unified Scene Text Detection and Layout Analysis , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jingdong Chen,et al.  SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dahua Lin,et al.  SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Philip H. S. Torr,et al.  Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting , 2022, ECCV.

[17]  P. Perona,et al.  Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[19]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[20]  Srikar Appalaraju,et al.  LaTr: Layout-Aware Transformer for Scene-Text VQA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sungrae Park,et al.  Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features , 2021, ECCV.

[22]  Xiaowei Hu,et al.  Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Aleksandr Drozd,et al.  Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics , 2021, INSIGHTS.

[25]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[26]  Yongdong Zhang,et al.  From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Ayan Kumar Bhunia,et al.  Towards the Unseen: Iterative Text Recognition by Distilling from Errors , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[29]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Rowel Atienza,et al.  Vision Transformer for Fast and Efficient Scene Text Recognition , 2021, ICDAR.

[31]  Tal Hassner,et al.  TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chunhua Shen,et al.  ABCNet v2: Adaptive Bezier-Curve Network for Real-Time End-to-End Text Spotting , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Yongdong Zhang,et al.  Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Kiyoharu Aizawa,et al.  What If We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[37]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[38]  R. Manmatha,et al.  On Calibration of Scene-Text Recognition Models , 2020, ECCV Workshops.

[39]  Pietro Perona,et al.  Sequence-to-Sequence Contrastive Learning for Text Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Shiliang Pu,et al.  MANGO: A Mask Attention Guided One-Stage Scene Text Spotter , 2020, AAAI.

[41]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[42]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[43]  Jing Huang,et al.  Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting , 2020, ECCV.

[44]  Weiping Wang,et al.  SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiebo Luo,et al.  On Vocabulary Reliance in Scene Text Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Errui Ding,et al.  Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  R. Manmatha,et al.  SCATTER: Selective Context Attentional Scene Text Recognizer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Hamid Reza Vaezi Joze,et al.  MMTM: Multimodal Transfer Module for CNN Fusion , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Wei Liu,et al.  Chinese Street View Text: Large-Scale Chinese Text Reading With Partially Supervised Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Lianwen Jin,et al.  ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[51]  Kai Zhou,et al.  ICDAR 2019 Robust Reading Challenge on Reading Chinese Text on Signboard , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[52]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[53]  Wafa Khlif,et al.  ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition — RRC-MLT-2019 , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[54]  Seong Joon Oh,et al.  What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Shijian Lu,et al.  ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17) , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[57]  A. Vedaldi,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[59]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[60]  Palaiahnakote Shivakumara,et al.  A robust arbitrary text detection system for natural scene images , 2014, Expert Syst. Appl..

[61]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[62]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[63]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[64]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[65]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[66]  Sharon Fogel,et al.  TextAdaIN: Fine-Grained AdaIN for Robust Text Recognition , 2021, ArXiv.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.