Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects

Human vision greatly benefits from the information about sizes of objects. The role of size in several visual reasoning tasks has been thoroughly explored in human perception and cognition. However, the impact of the information about sizes of objects is yet to be determined in AI. We postulate that this is mainly attributed to the lack of a comprehensive repository of size information. In this paper, we introduce a method to automatically infer object sizes, leveraging visual and textual information from web. By maximizing the joint likelihood of textual and visual observations, our method learns reliable relative size estimates, with no explicit human supervision. We introduce the relative size dataset and show that our method outperforms competitive textual and visual baselines in reasoning about size comparisons.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  K. Menger Zur allgemeinen Kurventheorie , 1927 .

[3]  Alfred H. Holway,et al.  Determinants of Apparent Visual Size with Distance Variant , 1941 .

[4]  W. H. Ittelson Size as a cue to distance: static localization. , 1951, The American journal of psychology.

[5]  John F. Hart,et al.  Computer Approximations , 1978 .

[6]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[7]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[8]  Jennifer Chu-Carroll,et al.  Hybridization in Question Answering Systems , 2003, New Directions in Question Answering.

[9]  Jennifer Chu-Carroll,et al.  IBM's PIQUANT in TREC2003 , 2003, TREC.

[10]  G. Marsaglia Evaluating the Normal Distribution , 2004 .

[11]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[12]  Honglak Lee,et al.  A Dynamic Bayesian Network Model for Autonomous 3D Reconstruction from a Single Indoor Image , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[13]  Kazuhiko Ohe,et al.  UTH: SVM-based Semantic Relation Classification using Physical Sizes , 2007, SemEval@ACL.

[14]  Adrian Iftene,et al.  UAIC Participation at RTE4 , 2008, TAC.

[15]  Catherine Havasi,et al.  ConceptNet: A lexical resource for common sense knowledge , 2009 .

[16]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Mihai Alex Moruz,et al.  UAIC Participation at RTE-7 , 2009, TAC.

[18]  Ari Rappoport,et al.  Extraction and Approximation of Numerical Attributes from the Web , 2010, ACL.

[19]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Kees van Deemter,et al.  On the Use of Size Modifiers When Referring to Visible Objects , 2011, CogSci.

[21]  A. Oliva,et al.  Canonical Visual Size for Real-world Objects , 2010 .

[22]  Daniel Fried,et al.  Bayesian geometric modeling of indoor scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  A. Oliva,et al.  A familiar-size Stroop effect: real-world size is an automatic property of object representation. , 2012, Journal of experimental psychology. Human perception and performance.

[24]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Yotaro Watanabe,et al.  Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web , 2013, ACL.

[26]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[27]  Gerhard Weikum,et al.  Acquiring Comparative Commonsense Knowledge from the Web , 2014, AAAI.

[28]  Oren Etzioni,et al.  Diagram Understanding in Geometry Questions , 2014, AAAI.

[29]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Ali Farhadi,et al.  Predicting Failures of Vision Systems , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Peter Clark,et al.  Learning Knowledge Graphs for Question Answering through Conversational Dialog , 2015, NAACL.

[33]  Yejin Choi,et al.  Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[35]  Oren Etzioni,et al.  Solving Geometry Problems: Combining Text and Diagram Interpretation , 2015, EMNLP.