General Facial Representation Learning in a Visual-Linguistic Manner

How to learn a universal facial representation that boosts all face analysis tasks? This paper takes one step toward this goal. In this paper, we study the transfer performance of pre-trained models on face analysis tasks and introduce a framework, called FaRL, for general Facial Representation Learning in a visual-linguistic manner. On one hand, the framework involves a contrastive loss to learn high-level semantic meaning from image-text pairs. On the other hand, we propose exploring low-level information simultaneously to further enhance the face representation, by adding a masked image modeling. We perform pre-training on LAION-FACE, a dataset containing large amount of face image-text pairs, and evaluate the representation capability on multiple downstream tasks. We show that FaRL achieves better transfer performance compared with previous pretrained models. We also verify its superiority in the lowdata regime. More importantly, our model surpasses the state-of-the-art methods on face analysis tasks including face parsing and face alignment.

[1]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[2]  Ming Zeng,et al.  Face Parsing With RoI Tanh-Warping , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Josef Kittler,et al.  Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[5]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Cheng Li,et al.  Face alignment by coarse-to-fine shape searching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[8]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[9]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[10]  Yi Yang,et al.  Teacher Supervises Students How to Learn From Partially Labeled Images for Facial Landmark Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Marek Kowalski,et al.  Deep Alignment Network: A Convolutional Neural Network for Robust Face Alignment , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[16]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[17]  Fuxin Li,et al.  Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[19]  Cheng Li,et al.  Unconstrained Face Alignment via Cascaded Compositional Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Stefanos Zafeiriou,et al.  300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[21]  Shuicheng Yan,et al.  Landmark Free Face Attribute Prediction , 2018, IEEE Transactions on Image Processing.

[22]  Kang Li,et al.  Towards Efficient U-Nets: A Coupled and Quantized Approach , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[24]  Debing Zhang,et al.  Lightweight Face Recognition Challenge , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[25]  Jiaya Jia,et al.  Aggregation via Separation: Boosting Facial Landmark Detector With Semi-Supervised Style Translation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[27]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[28]  Jianlong Fu,et al.  Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[30]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[31]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[33]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[35]  Matthias Niessner,et al.  Generalized Zero and Few-Shot Transfer for Facial Forgery Detection , 2020, ArXiv.

[36]  Ning Zhang,et al.  Laplace Landmark Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38]  Stefanos Zafeiriou,et al.  A Semi-automatic Methodology for Facial Landmark Annotation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Matthieu Cord,et al.  DeCaFA: Deep Convolutional Cascade for Face Alignment in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Ling Luo,et al.  EHANet: An Effective Hierarchical Aggregation Network for Face Parsing , 2020, Applied Sciences.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[44]  Yi Yang,et al.  Style Aggregated Network for Facial Landmark Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[46]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[48]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[49]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[50]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[51]  José Miguel Buenaposada,et al.  A Deeply-Initialized Coarse-to-fine Ensemble of Regression Trees for Face Alignment , 2018, ECCV.

[52]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[53]  Qiang Ji,et al.  Face Alignment With Kernel Density Deep Neural Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[55]  Zhongfei Zhang,et al.  Partially Shared Multi-task Convolutional Neural Network with Local Constraint for Face Attribute Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Yao Sun,et al.  Accurate Facial Image Parsing at Real-Time Speed , 2019, IEEE Transactions on Image Processing.

[57]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[58]  Jongyoo Kim,et al.  ADNet: Leveraging Error-Bias Towards Normal Direction in Face Alignment , 2021, ArXiv.

[59]  Jiahuan Zhou,et al.  Learning Robust Facial Landmark Detection via Hierarchical Structured Ensemble , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Hassan Foroosh,et al.  Slim-CNN: A Light-Weight CNN for Face Attribute Prediction , 2019, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[61]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[62]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Chunhua Shen,et al.  Learning Spatial-Semantic Relationship for Facial Attribute Recognition with Limited Labeled Data , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Dong Liu,et al.  High-Resolution Representations for Labeling Pixels and Regions , 2019, ArXiv.

[65]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[66]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[67]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Cheng Yang,et al.  Few-Shot Knowledge Transfer for Fine-Grained Cartoon Face Generation , 2020, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[69]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[70]  Heng Huang,et al.  Direct Shape Regression Networks for End-to-End Face Alignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71]  Stefanos Zafeiriou,et al.  300 Faces In-The-Wild Challenge: database and results , 2016, Image Vis. Comput..

[72]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[73]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[74]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[75]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[76]  Kimmo Kärkkäinen,et al.  FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age , 2019, ArXiv.

[77]  Feiyue Huang,et al.  Harnessing Synthesized Abstraction Images to Improve Facial Attribute Recognition , 2018, IJCAI.

[78]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[79]  William J. Christmas,et al.  Dynamic Attention-Controlled Cascaded Shape Regression Exploiting Training Data Augmentation and Fuzzy-Set Sample Weighting , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[82]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[83]  Yici Cai,et al.  Look at Boundary: A Boundary-Aware Face Alignment Algorithm , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[84]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[85]  Stefanos Zafeiriou,et al.  RetinaFace: Single-stage Dense Face Localisation in the Wild , 2019, ArXiv.

[86]  Ye Wang,et al.  LUVLi Face Alignment: Estimating Landmarks’ Location, Uncertainty, and Visibility Likelihood , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Hailin Shi,et al.  Edge-aware Graph Representation Learning and Reasoning for Face Parsing , 2020, ECCV.

[88]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[89]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[90]  Hailin Shi,et al.  A New Dataset and Boundary-Attention Semantic Segmentation for Face Parsing , 2020, AAAI.

[91]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[92]  Yan Yan,et al.  Deep Multi-Task Multi-Label CNN for Effective Facial Attribute Classification , 2020, IEEE Transactions on Affective Computing.

[93]  Kimin Lee,et al.  Using Pre-Training Can Improve Model Robustness and Uncertainty , 2019, ICML.

[94]  Wenyan Wu,et al.  Leveraging Intra and Inter-Dataset Variations for Robust Face Alignment , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[95]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[96]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[97]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[98]  Christian Wallraven,et al.  3FabRec: Fast Few-Shot Face Alignment by Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[100]  Lingyun Wu,et al.  MaskGAN: Towards Diverse and Interactive Facial Image Manipulation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[102]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[103]  Yuxiao Hu,et al.  MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition , 2016, ECCV.

[104]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[105]  Georgios Tzimiropoulos,et al.  Pre-training strategies and datasets for facial representation learning , 2021, ArXiv.

[106]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[107]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[108]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[109]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[110]  Weihong Deng,et al.  Face Transformer for Recognition , 2021, ArXiv.

[111]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[112]  Tao Mei,et al.  AGRNet: Adaptive Graph Representation Learning and Reasoning for Face Parsing , 2021, IEEE Transactions on Image Processing.

[113]  Jian Sun,et al.  Face Alignment by Explicit Shape Regression , 2012, International Journal of Computer Vision.