Toulouse campus surveillance dataset: scenarios, soundtracks, synchronized videos with overlapping and disjoint views

In surveillance applications, humans and vehicles are the most important common elements studied. In consequence, detecting and matching a person or a car that appears on several videos is a key problem. Many algorithms have been introduced and nowadays, a major relative problem is to evaluate precisely and to compare these algorithms, in reference to a common ground-truth. In this paper, our goal is to introduce a new dataset for evaluating multi-view based methods. This dataset aims at paving the way for multidisciplinary approaches and applications such as 4D-scene reconstruction, object identification/tracking, audio event detection and multi-source meta-data modeling and querying. Consequently, we provide two sets of 25 synchronized videos with audio tracks, all depicting the same scene from multiple viewpoints, each set of videos following a detailed scenario consisting in comings and goings of people and cars. Every video was annotated by regularly drawing bounding boxes on every moving object with a flag indicating whether the object is fully visible or occluded, specifying its category (human or vehicle), providing visual details (for example clothes types or colors), and timestamps of its apparitions and disappearances. Audio events are also annotated by a category and timestamps.

[1]  Clark N. Taylor,et al.  IEEE Transactions on Circuits and Systems for Video Technology information for authors , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[3]  Richard Szeliski,et al.  Building Rome in a day , 2009, ICCV.

[4]  Hossein Ragheb,et al.  MuHAVi: A Multicamera Human Action Video Dataset for the Evaluation of Action Recognition Methods , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[5]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[6]  Remco C. Veltkamp,et al.  UMPM benchmark: A multi-person dataset with synchronized video and motion capture data for evaluation of articulated human motion and interaction , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[7]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[8]  Rita Cucchiara,et al.  3DPeS: 3D people dataset for surveillance and forensics , 2011, J-HGBU '11.

[9]  Zvi Kons,et al.  Audio event classification using deep neural networks , 2013, INTERSPEECH.

[10]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[11]  Antonio Fernández-Caballero,et al.  A survey of video datasets for human action and activity recognition , 2013, Comput. Vis. Image Underst..

[12]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[13]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  U. K. Jaliya,et al.  A Survey on Object Detection and Tracking Methods , 2014 .

[15]  Shishir K. Shah,et al.  A survey of approaches and trends in person re-identification , 2014, Image Vis. Comput..

[16]  Bingbing Ni,et al.  Crowded Scene Analysis: A Survey , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Vittorio Murino,et al.  Audio Surveillance , 2014, ACM Comput. Surv..

[18]  Mohan S. Kankanhalli,et al.  Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Jean-Rémy Gloaguen,et al.  Creation of a corpus of realistic urban sound scenes with controlled acoustic properties , 2017 .

[20]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).