The YLI-MED Corpus: Characteristics, Procedures, and Plans

The YLI Multimedia Event Detection corpus is a public-domain index of videos with annotations and computed features, specialized for research in multimedia event detection (MED), i.e., automatically identifying what's happening in a video by analyzing the audio and visual content. The videos indexed in the YLI-MED corpus are a subset of the larger YLI feature corpus, which is being developed by the International Computer Science Institute and Lawrence Livermore National Laboratory based on the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset. The videos in YLI-MED are categorized as depicting one of ten target events, or no target event, and are annotated for additional attributes like language spoken and whether the video has a musical score. The annotations also include degree of annotator agreement and average annotator confidence scores for the event categorization of each video. Version 1.0 of YLI-MED includes 1823 "positive" videos that depict the target events and 48,138 "negative" videos, as well as 177 supplementary videos that are similar to event videos but are not positive examples. Our goal in producing YLI-MED is to be as open about our data and procedures as possible. This report describes the procedures used to collect the corpus; gives detailed descriptive statistics about the corpus makeup (and how video attributes affected annotators' judgments); discusses possible biases in the corpus introduced by our procedural choices and compares it with the most similar existing dataset, TRECVID MED's HAVIC corpus; and gives an overview of our future plans for expanding the annotation effort.

[1]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[2]  E. Rosch,et al.  Categorization of Natural Objects , 1981 .

[3]  Charles J. Fillmore,et al.  Frames and the semantics of understanding , 1985 .

[4]  G. Lakoff,et al.  Women, Fire, and Dangerous Things: What Categories Reveal about the Mind , 1988 .

[5]  G. Lakoff Women, fire, and dangerous things : what categories reveal about the mind , 1989 .

[6]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[7]  Marieke Guy,et al.  Folksonomies: Tidying Up Tags? , 2006, D Lib Mag..

[8]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[9]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[10]  Adrian Ulges Visual Concept Learning from User-tagged Web Video , 2009 .

[11]  Adrian Ulges,et al.  Visual Concept Learning from Weakly Labeled Web Videos , 2010, Video Search and Mining.

[12]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[13]  Abebe Rorissa,et al.  A comparative study of Flickr tags and index terms in a general image collection , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[16]  Afshin Dehghan,et al.  SRI-Sarnoff AURORA System at TRECVID 2013 Multimedia Event Detection and Recounting , 2013, TRECVID.

[17]  Markus Koch,et al.  Linking visual concept detection with viewer demographics , 2012, ICMR '12.

[18]  Gerald Friedland,et al.  An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content , 2013, 2013 IEEE International Symposium on Multimedia.

[19]  James Allan,et al.  Multimedia Event Detection and Recounting , 2013 .

[20]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[21]  Georgios Petkos,et al.  Social Event Detection at MediaEval : a three-year retrospect of tasks and results , 2014 .

[22]  Jaeyoung Choi,et al.  The Placing Task: A Large-Scale Geo-Estimation Challenge for Social-Media Videos and Images , 2014, GeoMM '14.

[23]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[24]  Vasileios Mezaris Social event detection at MediaEval: a 3-year retrospect of tasks and results , 2014 .

[25]  Forrest N. Iandola,et al.  Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling , 2015, ICMR.

[26]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.