Evaluation of automatic video captioning using direct assessment

We present Direct Assessment, a method for manually assessing the quality of automatically-generated captions for video. Evaluating the accuracy of video captions is particularly difficult because for any given video clip there is no definitive ground truth or correct answer against which to measure. Metrics for comparing automatic video captions against a manual caption such as BLEU and METEOR, drawn from techniques used in evaluating machine translation, were used in the TRECVid video captioning task in 2016 but these are shown to have weaknesses. The work presented here brings human assessment into the evaluation by crowd sourcing how well a caption describes a video. We automatically degrade the quality of some sample captions which are assessed manually and from this we are able to rate the quality of the human assessors, a factor we take into account in the evaluation. Using data from the TRECVid video-to-text task in 2016, we show how our direct assessment method is replicable and robust and scales to where there are many caption-generation techniques to be evaluated including the TRECVid video-to-text task in 2017.

[1]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[2]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[3]  Timothy Baldwin,et al.  Is Machine Translation Getting Better over Time? , 2014, EACL.

[4]  Alan F. Smeaton,et al.  Dublin City University and Partners' Participation in the INS and VTT Tracks at TRECVid 2016 , 2016, TRECVID.

[5]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[6]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[7]  Yoshihiko Gotoh,et al.  The University of Sheffield and University of Engineering & Technology, Lahore at TRECVID 2016: Video to Text Description Task , 2016, TRECVID.

[8]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.

[9]  Marina Bosch,et al.  ImageCLEF, Experimental Evaluation in Visual Information Retrieval , 2010 .

[10]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[11]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12]  Stephen Wan,et al.  GLEU: Automatic Evaluation of Sentence-Level Fluency , 2007, ACL.

[13]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[17]  Reza Zafarani,et al.  Evaluation without ground truth in social media research , 2015, Commun. ACM.

[18]  Ondrej Bojar,et al.  A Grain of Salt for the WMT Manual Evaluation , 2011, WMT@EMNLP.

[19]  Qun Liu,et al.  Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task , 2017, WMT.

[20]  Timothy Baldwin,et al.  Improving Evaluation of Document-level Machine Translation Quality Estimation , 2017, EACL.

[21]  Xirong Li,et al.  University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video , 2016, TRECVID.

[22]  Duy-Dinh Le,et al.  NII-HITACHI-UIT at TRECVID 2017 , 2016, TRECVID.

[23]  Yvette Graham,et al.  Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.

[24]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[25]  Jeffrey T. Hancock,et al.  Experimental evidence of massive-scale emotional contagion through social networks , 2014, Proceedings of the National Academy of Sciences.

[26]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[27]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[28]  Georges Quénot,et al.  TRECVid Semantic Indexing of Video: A 6-year Retrospective , 2016 .

[29]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[30]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[31]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[32]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[33]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).