SUBMITTED VERSION

Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the form of text. Current methods jointly embed both the visual information and the textual feature into the same space. However, how to model the complex interactions between the two different modalities is not an easy task. In contrast to struggling on multimodal feature fusion, in this paper, we propose to unify all the input information by natural language so as to convert VQA into a machine reading comprehension problem. With this transformation, our method not only can tackle VQA datasets that focus on observation based questions, but can also be naturally extended to handle knowledge-based VQA which requires to explore large-scale external knowledge base. It is a step towards being able to exploit large volumes of text and natural language processing techniques to address VQA problem. Two types of models are proposed to deal with open-ended VQA and multiple-choice VQA respectively. We evaluate our models on three VQA benchmarks. The comparable performance with the state-of-the-art demonstrates the effectiveness of the proposed method.

P. Ricker | P. Wessels | I. Nardecchia | L. Naticchioni | G. Nelemans | A. Neunzert | S. Nissanke | A. Nitz | F. Nocera | J. Oberling | F. Ohme | M. Oliver | P. Oppermann | R. Oram | H. Overmier | B. Owen | C. Pankow | F. Pannarale | F. Paoletti | A. Paoli | D. Pascucci | A. Pasqualetti | R. Passaquieti | D. Passuello | B. Pearlstone | M. Pedraza | R. Pedurand | L. Pekowsky | A. Pele | A. Perreca | M. Phelps | F. Piergiovanni | V. Pierro | G. Pillant | L. Pinard | M. Pitkin | R. Poggiani | A. Post | J. Prasad | V. Predoi | T. Prestegard | M. Punturo | P. Puppo | M. Pürrer | E. Quintero | R. Quitzow-James | F. Raab | D. Rabeling | H. Radkins | S. Raja | M. Rakhmanov | P. Rapagnani | M. Razzano | J. Read | T. Regimbau | L. Rei | S. Reid | F. Ricci | K. Riles | R. Robie | F. Robinet | L. Rolland | J. Rollins | R. Romano | J. Romie | S. Rowan | A. Rüdiger | P. Ruggi | S. Sachdev | T. Sadecki | M. Saleem | F. Salemi | A. Samajdar | L. Sammut | E. Sanchez | B. Sassolas | O. Sauter | A. Sawadsky | A. Schönbeck | E. Schreiber | D. Schuette | J. Scott | D. Sellers | V. Sequino | A. Sergeev | Y. Setyawati | D. Shaddock | B. Shapiro | K. Siellez | D. Sigg | B. Slagmolen | B. Sorazu | F. Sorrentino | T. Souradeep | J. Steinlechner | S. Steinlechner | D. Steinmeyer | G. Stratta | D. Talukder | R. Taylor | T. Theeg | E. Thrane | S. Tiwari | V. Tiwari | D. Töyrä | F. Travasso | L. Trozzo | S. Vass | E. Seidel | J. Neilson | T. Nelson | M. Nery | H. Ohta | B. O'reilly | R. Ormiston | L. F. Ortega | S. Ossokine | G. Pagano | A. Parida | M. Patil | B. Patricelli | C. Pedersen | H. Pfeiffer | M. Pirello | P. Popolizio | G. Pratten | L. Prokhorov | O. Puncken | B. Rajbhandari | K. Ramirez | A. Ramos-Buades | J. Rana | W. Ren | M. Rizzo | C. Romel | G. Rutins | K. Ryan | L. Salconi | N. Sanchis-Gual | B. Schulte | S. Schwalbe | A. Sengupta | L. Shao | H. Shen | A. Singhal | S. Somala | K. Staats | A. Strunk | J. Suresh | C. Talbot | F. Thies | Z. Tornasi | V. Varma | K. Venkateswara | Gautam Venugopalan | F. Vetrano | J. Watchi | I. Pinto | K. Napier | L. Nevin | J. Page | L. Perri | B. Smith | D. Reitze | T. Robson | S. Tewari | B. Steltner | W. Parker | N. Robertson | J. Scheuer | P. Schmidt | B. Schutz | A. Sintes | L. Possenti | V. Quetschke | M. Sakellariadou | A. Rocchi | M. Steinke | M. Weinert | A. Viceré | D. Rosińska | A. Singh | S. Oh | S. -. Oh | R. Taylor

[1]  Alexander G. Schwing,et al.  Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering , 2018, ECCV.

[2]  R. Magee,et al.  Methods for the detection of gravitational waves from subsolar mass ultracompact binaries , 2018, Physical Review D.

[3]  S. Shandera,et al.  Gravitational Waves from Binary Mergers of Subsolar Mass Dark Black Holes. , 2018, Physical review letters.

[4]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[5]  Wenwu Zhu,et al.  Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks , 2017, ArXiv.

[6]  M. Mclaughlin,et al.  Pulsar J1411+2551: A Low-mass Double Neutron Star System , 2017, 1711.09804.

[7]  B. A. Boom,et al.  GW170608: Observation of a 19 Solar-mass Binary Black Hole Coalescence , 2017, 1711.05578.

[8]  The Ligo Scientific Collaboration,et al.  GW170817: Observation of Gravitational Waves from a Binary Neutron Star Inspiral , 2017, 1710.05832.

[9]  Y. Wang,et al.  Effects of data quality vetoes on a search for compact binary coalescences in Advanced LIGO’s first observing run , 2017, 1710.02185.

[10]  B. A. Boom,et al.  GW170814: A Three-Detector Observation of Gravitational Waves from a Binary Black Hole Coalescence. , 2017, Physical review letters.

[11]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Chunhua Shen,et al.  Visual Question Answering with Memory-Augmented Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  M. Raidal,et al.  Gravitational waves from primordial black hole mergers , 2017, 1707.01480.

[14]  Li Fei-Fei,et al.  Knowledge Acquisition for Visual Question Answering via Iterative Querying , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[16]  B. A. Boom,et al.  GW170104: Observation of a 50-Solar-Mass Binary Black Hole Coalescence at Redshift 0.2. , 2017, Physical review letters.

[17]  B. A. Boom,et al.  Search for intermediate mass black hole binaries in the first observing run of Advanced LIGO , 2017, 1704.04628.

[18]  A. Loeb,et al.  Dynamics of Dwarf Galaxies Disfavor Stellar-Mass Black Holes as Dark Matter. , 2017, Physical review letters.

[19]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[20]  Qi Wu,et al.  The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  B. A. Boom,et al.  Upper Limits on the Stochastic Gravitational-Wave Background from Advanced LIGO's First Observing Run , 2016, 1612.02029.

[22]  Li-Jia Li,et al.  Dense Captioning with Joint Inference and Visual Context , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  M. Kamionkowski,et al.  Black hole mass function from gravitational wave measurements , 2016, 1611.01157.

[25]  T. Li,et al.  Constraints on the Primordial Black Hole Abundance from the First Advanced LIGO Observation Run Using the Stochastic Gravitational-Wave Background. , 2016, Physical review letters.

[26]  I. Cholis On the gravitational wave background from black hole binaries after the first LIGO detections , 2016, 1609.03565.

[27]  V. Mandic,et al.  Stochastic Gravitational-Wave Background due to Primordial Binary Black Hole Mergers. , 2016, Physical review letters.

[28]  Claire Gardent,et al.  Sequence-based Structured Prediction for Semantic Parsing , 2016, ACL.

[29]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[30]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[32]  D Huet,et al.  GW151226: Observation of Gravitational Waves from a 22-Solar-Mass Binary Black Hole Coalescence , 2016 .

[33]  B. A. Boom,et al.  Binary Black Hole Mergers in the First Advanced LIGO Observing Run , 2016, 1606.04856.

[34]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[35]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[36]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[37]  Timothy D. Brandt CONSTRAINTS ON MACHO DARK MATTER FROM COMPACT STELLAR SYSTEMS IN ULTRA-FAINT DWARF GALAXIES , 2016, 1605.03665.

[38]  Cody Messick,et al.  Analysis framework for the prompt discovery of compact binary mergers in gravitational-wave data , 2016, 1604.04324.

[39]  Takahiro Tanaka,et al.  Primordial Black Hole Scenario for the Gravitational-Wave Event GW150914. , 2016, Physical review letters.

[40]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[41]  A. Riess,et al.  Did LIGO Detect Dark Matter? , 2016, Physical review letters.

[42]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[43]  The Ligo Scientific Collaboration,et al.  Observation of Gravitational Waves from a Binary Black Hole Merger , 2016, 1602.03837.

[44]  S. Privitera,et al.  Implementing a search for gravitational waves from binary black holes with nonprecessing spin , 2016 .

[45]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[50]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[51]  Jason Weston,et al.  Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[52]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[53]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[54]  Gabriela Gonzalez,et al.  The LIGO Scientific Collaboration , 2015 .

[55]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[56]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[57]  Gerhard Weikum,et al.  Acquiring Comparative Commonsense Knowledge from the Web , 2014, AAAI.

[58]  Jason Weston,et al.  Question Answering with Subgraph Embeddings , 2014, EMNLP.

[59]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[60]  B. A. Boom,et al.  Prospects for observing and localizing gravitational-wave transients with Advanced LIGO, Advanced Virgo and KAGRA , 2013, Living Reviews in Relativity.

[61]  P. Ajith,et al.  Effectual template bank for the detection of gravitational waves from inspiralling compact binaries with generic spins , 2012, 1210.6666.

[62]  Jillian Bellovary,et al.  Black holes in the early Universe , 2012, Reports on progress in physics. Physical Society.

[63]  W. Farr,et al.  MASS MEASUREMENTS OF BLACK HOLES IN X-RAY TRANSIENTS: IS THERE A MASS GAP? , 2012, 1205.1805.

[64]  Erin Kara,et al.  TOWARD EARLY-WARNING DETECTION OF GRAVITATIONAL WAVES FROM COMPACT BINARY COALESCENCE , 2011, 1107.2665.

[65]  I. Mandel,et al.  THE MASS DISTRIBUTION OF STELLAR-MASS BLACK HOLES , 2010, 1011.1459.

[66]  R. Narayan,et al.  THE BLACK HOLE MASS DISTRIBUTION IN THE GALAXY , 2010, 1006.2834.

[67]  Yi Pan,et al.  Comparison of post-Newtonian templates for compact binary inspiral signals in gravitational-wave detectors , 2009, 0907.0700.

[68]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[69]  S. Fairhurst,et al.  The loudest event statistic: general formulation, properties and applications , 2007, 0710.0465.

[70]  et al,et al.  Search for gravitational waves from binary inspirals in S3 and S4 LIGO data , 2007, 0704.3368.

[71]  J. Beaulieu,et al.  Limits on the Macho Content of the Galactic Halo from the EROS-2 Survey of the Magellanic Clouds , 2006, astro-ph/0607207.

[72]  E. al.,et al.  Search for gravitational waves from primordial black hole binary coalescences in the galactic halo , 2005, gr-qc/0505042.

[73]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[74]  S. Goriely,et al.  Analytical representations of unified equations of state of neutron-star matter , 2004, astro-ph/0408324.

[75]  O. Lahav,et al.  Cosmological parameters , 2008 .

[76]  A. Tomaney,et al.  MACHO Project Limits on Black Hole Dark Matter in the 1-30 M☉ Range , 2000, astro-ph/0011506.

[77]  Takahiro Tanaka,et al.  Black hole binary formation in the expanding universe: Three body problem approximation , 1998, astro-ph/9807018.

[78]  K. Thorne,et al.  Gravitational Waves from Coalescing Black Hole MACHO Binaries , 1997, astro-ph/9708060.

[79]  N. Glendenning Compact Stars: Nuclear Physics, Particle Physics, and General Relativity , 1996 .

[80]  Blanchet,et al.  Gravitational-radiation damping of compact binary systems to second post-Newtonian order. , 1995, Physical review letters.

[81]  G. Chapline,et al.  Cosmological effects of primordial black holes , 1975, Nature.

[82]  P. Mészáros The behaviour of point masses in an expanding cosmological substratum. , 1974 .

[83]  Stephen W. Hawking,et al.  Gravitationally collapsed objects of very low mass , 1971 .

[84]  S. Chandrasekhar The maximum mass of ideal white dwarfs , 1931 .

[85]  E. A. Milne,et al.  The Highly Collapsed Configurations of a Stellar Mass , 1931 .

[86]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[87]  S. Klimenko,et al.  Advanced LIGO , 2014, 1411.4547.

[88]  V. Mandic,et al.  Accessibility of the pre-bigbang models to LIGO , 2006 .

[89]  K. Jedamzik Primordial Black Holes as Dark Matter , 2001 .

[90]  Y. Zel’dovich,et al.  The Hypothesis of Cores Retarded during Expansion and the Hot Cosmological Model , 1966 .

[91]  S. Chandrasekhar The highly collapsed configurations of a stellar mass (Second paper) , 1935 .