论文信息 - Pythia-A platform for vision & language research

Pythia-A platform for vision & language research

This paper presents Pythia, a deep learning research platform for vision & language tasks. Pythia is built with a plug-&-play strategy at its core, which enables researchers to quickly build, reproduce and benchmark novel models for vision & language tasks like Visual Question Answering (VQA), Visual Dialog and Image Captioning. Built on top of PyTorch, Pythia features (i) high level abstractions for operations commonly used in vision & language tasks (ii) a modular and easily extensible framework for rapid prototyping and (iii) a flexible trainer API that can handle tasks seamlessly. Pythia is the first framework to support multi-tasking in the vision & language domain. Pythia also includes reference implementations of several recent state-of-the-art models for benchmarking, along with utilities such as smart configuration, multiple metrics, checkpointing, reporting, logging, etc. Our hope is that by providing a research platform focusing on flexibility, reproducibility and efficiency, we can help researchers push state-of-the-art for vision & language tasks. Over the last few years, we have seen impressive progress in vision & language tasks like Visual Question Answering (VQA) and Image Captioning powered by deep learning. Most of the state-ofthe-art networks build upon the same techniques for generating the representations of text and images and for the network’s layers. However, the devil lies in the details, hence reproducing results from the state-of-the-art models has often been non-trivial. This in-turn hinders faster experimentation and progress in research. With Pythia1, we hope to break down these design, implementation and reproducibility barriers by providing a modular and flexible platform for vision & language (VQA and related) tasks’ research ([10][6][14]) which in turn enables easy reproducibility and fosters novel research by taking care of low level details around IO, tasks, datasets and model architectures while providing flexibility to easily try out new ideas. Pythia is built on top of the winning entries to the VQA Challenge 2018 and Vizwiz Challenge 2018. Pythia includes a set of reference implementations of some current state-of-the-art models for easy comparison2. We derive inspiration from software suites like AllenNLP [8], Detectron [9], and ParlAI [16] which aim to break similar barriers in other machine-learning domains like natural language processing and computer vision. Framework Design: In Pythia (refer Figure 1a), we have a central trainer which loads a bootstrapper which sets up components required for training. Bootstrapper builds a model based on the network configuration provided by the researcher. For loading the data, bootstrapper instantiates task loader which can load multiple tasks based upon the configuration. Pythia works on a plugin based registry where tasks and models register themselves to a particular key in the registry mapping. Furthermore, the datasets register themselves to one or more tasks. This registry helps in dynamic loading of models and tasks at runtime based on configuration. A task first builds, if not present, and then loads the datasets registered to it. A dataset is responsible for its metrics, logging and loss function, thus, keeping the trainer agnostic to the data details. See Figure 1b for a tree overview of tasks (second A preliminary version of Pythia (v0.2) is available at https://github.com/facebookresearch/pythia. Note that, v0.3, which is described in this abstract will be open-sourced soon. We plan to release pre-trained models for these implementations for easy comparisons in v0.3. Preprint. Work in progress.