This paper presents Pythia, a deep learning research platform for vision & language tasks. Pythia is built with a plug-&-play strategy at its core, which enables researchers to quickly build, reproduce and benchmark novel models for vision & language tasks like Visual Question Answering (VQA), Visual Dialog and Image Captioning. Built on top of PyTorch, Pythia features (i) high level abstractions for operations commonly used in vision & language tasks (ii) a modular and easily extensible framework for rapid prototyping and (iii) a flexible trainer API that can handle tasks seamlessly. Pythia is the first framework to support multi-tasking in the vision & language domain. Pythia also includes reference implementations of several recent state-of-the-art models for benchmarking, along with utilities such as smart configuration, multiple metrics, checkpointing, reporting, logging, etc. Our hope is that by providing a research platform focusing on flexibility, reproducibility and efficiency, we can help researchers push state-of-the-art for vision & language tasks. Over the last few years, we have seen impressive progress in vision & language tasks like Visual Question Answering (VQA) and Image Captioning powered by deep learning. Most of the state-ofthe-art networks build upon the same techniques for generating the representations of text and images and for the network’s layers. However, the devil lies in the details, hence reproducing results from the state-of-the-art models has often been non-trivial. This in-turn hinders faster experimentation and progress in research. With Pythia1, we hope to break down these design, implementation and reproducibility barriers by providing a modular and flexible platform for vision & language (VQA and related) tasks’ research ([10][6][14]) which in turn enables easy reproducibility and fosters novel research by taking care of low level details around IO, tasks, datasets and model architectures while providing flexibility to easily try out new ideas. Pythia is built on top of the winning entries to the VQA Challenge 2018 and Vizwiz Challenge 2018. Pythia includes a set of reference implementations of some current state-of-the-art models for easy comparison2. We derive inspiration from software suites like AllenNLP [8], Detectron [9], and ParlAI [16] which aim to break similar barriers in other machine-learning domains like natural language processing and computer vision. Framework Design: In Pythia (refer Figure 1a), we have a central trainer which loads a bootstrapper which sets up components required for training. Bootstrapper builds a model based on the network configuration provided by the researcher. For loading the data, bootstrapper instantiates task loader which can load multiple tasks based upon the configuration. Pythia works on a plugin based registry where tasks and models register themselves to a particular key in the registry mapping. Furthermore, the datasets register themselves to one or more tasks. This registry helps in dynamic loading of models and tasks at runtime based on configuration. A task first builds, if not present, and then loads the datasets registered to it. A dataset is responsible for its metrics, logging and loss function, thus, keeping the trainer agnostic to the data details. See Figure 1b for a tree overview of tasks (second A preliminary version of Pythia (v0.2) is available at https://github.com/facebookresearch/pythia. Note that, v0.3, which is described in this abstract will be open-sourced soon. We plan to release pre-trained models for these implementations for easy comparisons in v0.3. Preprint. Work in progress.
[1]
José M. F. Moura,et al.
Visual Dialog
,
2019,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[2]
Ming-Wei Chang,et al.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
,
2019,
NAACL.
[3]
Xinlei Chen,et al.
Pythia v0.1: the Winning Entry to the VQA Challenge 2018
,
2018,
ArXiv.
[4]
Richard Socher,et al.
The Natural Language Decathlon: Multitask Learning as Question Answering
,
2018,
ArXiv.
[5]
Byoung-Tak Zhang,et al.
Bilinear Attention Networks
,
2018,
NeurIPS.
[6]
Omer Levy,et al.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
,
2018,
BlackboxNLP@EMNLP.
[7]
Luke S. Zettlemoyer,et al.
AllenNLP: A Deep Semantic Natural Language Processing Platform
,
2018,
ArXiv.
[8]
Christopher Joseph Pal,et al.
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
,
2018,
ICLR.
[9]
Luke S. Zettlemoyer,et al.
Deep Contextualized Word Representations
,
2018,
NAACL.
[10]
Alec Radford,et al.
Improving Language Understanding by Generative Pre-Training
,
2018
.
[11]
Lei Zhang,et al.
Bottom-Up and Top-Down Attention for Image Captioning and VQA
,
2017,
ArXiv.
[12]
Jason Weston,et al.
ParlAI: A Dialog Research Software Platform
,
2017,
EMNLP.
[13]
Holger Schwenk,et al.
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
,
2017,
EMNLP.
[14]
Li Fei-Fei,et al.
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
,
2016,
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15]
Yash Goyal,et al.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
,
2016,
International Journal of Computer Vision.
[16]
Yoshua Bengio,et al.
Neural Machine Translation by Jointly Learning to Align and Translate
,
2014,
ICLR.
[17]
Jeffrey Pennington,et al.
GloVe: Global Vectors for Word Representation
,
2014,
EMNLP.
[18]
Quoc V. Le,et al.
Sequence to Sequence Learning with Neural Networks
,
2014,
NIPS.
[19]
Yoshua Bengio,et al.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
,
2014,
EMNLP.
[20]
Pietro Perona,et al.
Microsoft COCO: Common Objects in Context
,
2014,
ECCV.
[21]
Jeffrey P. Bigham,et al.
VizWiz: nearly real-time answers to visual questions
,
2010,
W4A.