Rethinking Object Detection in Retail Stores

The convention standard for object detection uses a bounding box to represent each individual object instance. However, it is not practical in the industry-relevant applications in the context of warehouses due to severe occlusions among groups of instances of the same categories. In this paper, we propose a new task, ie, simultaneously object localization and counting, abbreviated as Locount, which requires algorithms to localize groups of objects of interest with the number of instances. However, there does not exist a dataset or benchmark designed for such a task. To this end, we collect a large-scale object localization and counting dataset with rich annotations in retail stores, which consists of 50,394 images with more than 1.9 million object instances in 140 categories. Together with this dataset, we provide a new evaluation protocol and divide the training and testing subsets to fairly evaluate the performance of algorithms for Locount, developing a new benchmark for the Locount task. Moreover, we present a cascaded localization and counting network as a strong baseline, which gradually classifies and regresses the bounding boxes of objects with the predicted numbers of instances enclosed in the bounding boxes, trained in an end-to-end manner. Extensive experiments are conducted on the proposed dataset to demonstrate its significance and the analysis discussions on failure cases are provided to indicate future directions. Dataset is available at this https URL.

[1]  Michele Merler,et al.  Recognizing Groceries in situ Using in vitro Training Data , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Markus Ulrich,et al.  MVTec D2S: Densely Segmented Supermarket Dataset , 2018, ECCV.

[4]  Lei Yang,et al.  RPC: A Large-Scale Retail Product Checkout Dataset , 2019, ArXiv.

[5]  Qi Tian,et al.  Data Priming Network for Automatic Check-Out , 2019, ACM Multimedia.

[6]  Tal Hassner,et al.  Precise Detection in Densely Packed Scenes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Stephen Lin,et al.  RepPoints: Point Set Representation for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Asaf Tzadok,et al.  Fine-Grained Recognition of Thousands of Object Categories with Single-Example Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  F. Khan,et al.  Object Counting and Instance Segmentation With Image-Level Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Andrew Zisserman,et al.  Learning To Count Objects in Images , 2010, NIPS.

[11]  Ramprasaath R. Selvaraju,et al.  Counting Everyday Objects in Everyday Scenes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mark W. Schmidt,et al.  Where are the Blobs: Counting by Localization with Point Supervision , 2018, ECCV.

[13]  Alan Fern,et al.  Person count localization in videos from noisy foreground and detections , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, ECCV.

[16]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Anil A. Bharath,et al.  Small Hand-held Object Recognition Test (SHORT) , 2014, IEEE Winter Conference on Applications of Computer Vision.

[18]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[19]  Joost van de Weijer,et al.  Leveraging Unlabeled Data for Crowd Counting by Learning to Rank , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Saturnino Maldonado-Bascón,et al.  Extremely Overlapping Vehicle Counting , 2015, IbPRIA.

[21]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Gül Varol,et al.  Toward retail product recognition on grocery shelves , 2015, International Conference on Graphic and Image Processing.

[23]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[24]  Shifeng Zhang,et al.  Single-Shot Refinement Neural Network for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Yu Hao,et al.  Take Goods from Shelves: A Dataset for Class-Incremental Object Detection , 2019, ICMR.

[26]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  José M. F. Moura,et al.  FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Jiri Matas,et al.  Evaluating Colour-Based Object Recognition Algorithms Using the SOIL-47 Database , 2002 .

[29]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[30]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[31]  Wolfram Burgard,et al.  The Freiburg Groceries Dataset , 2016, ArXiv.

[32]  Jacques Wainer,et al.  Automatic fruit and vegetable classification from images , 2010 .

[33]  Shenghua Gao,et al.  Single-Image Crowd Counting via Multi-Column Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Christian Floerkemeier,et al.  Recognizing Products: A Per-exemplar Multi-label Image Classification Approach , 2014, ECCV.

[37]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.