M2-VISD:A Visual Intelligence Evaluation Dataset Based on Virtual Scene*

Most of the existing visual intelligence evaluation datasets like ImageNet or ActivityNet share a common property, namely consisting of many images or videos captured from the real world. This property makes this evaluation datasets suitable for actual applications, but has limitations in achieving the diversity and interactivity of evaluation environments. The currently available datasets do not systematically provide data of different scales and different angles. In order to solve the above problems, this paper constructs a multi-angle and multi-scale data set based on UE4 platform, which can evaluate the performance of the algorithm from both scale and angle in an all-round way. Our experiments show that the algorithms detection performance varies greatly under different scale and different angle. Particularly, in scale-data, when the distance between the camera and the object is less than 50cm or greater than 3200cm, algorithm performance is poor, and in angle-data, the camera’s pitch angle is also poor when it is 18 degree and 0 degree. This result has a guiding role for the correct evaluation of the algorithm performance and provides help for the innovation and optimization of the algorithm.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[5]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[7]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[9]  Erik Learned-Miller,et al.  FDDB: A benchmark for face detection in unconstrained settings , 2010 .

[10]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[11]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[12]  D.R. Morgan Dos and don'ts of technical writing , 2005, IEEE Potentials.

[13]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[14]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Alexander Schnack,et al.  Immersive virtual reality technology in a three-dimensional virtual simulated store: Investigating telepresence and usability. , 2019, Food research international.

[16]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[17]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .