Evaluating Multimedia and Language Tasks

Evaluating information access tasks, including textual and multimedia search, question answering, and understanding has been the core mission of NIST's Retrieval Group since 1989. The TRECVID Evaluations of Multimedia Access began in 2001 with a goal of driving content-based search technology for multimedia just as its progenitor, the Text Retrieval Conference (TREC) did for text and web1.

[1]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Mark D. Smucker,et al.  A System for Efficient High-Recall Retrieval , 2018, SIGIR.

[3]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[4]  George Awad,et al.  Evaluation of automatic video captioning using direct assessment , 2017, PloS one.

[5]  Tsvi Kuflik,et al.  From Evaluating to Forecasting Performance: How to Turn Information Retrieval, Natural Language Processing and Recommender Systems into Predictive Sciences (Dagstuhl Perspectives Workshop 17442) , 2018, Dagstuhl Manifestos.

[6]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[7]  Alan F. Smeaton,et al.  The scholarly impact of TRECVid (2003-2009) , 2011, J. Assoc. Inf. Sci. Technol..

[8]  Emine Yilmaz,et al.  Research Frontiers in Information Retrieval Report from the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018) , 2018 .

[9]  Jonathan G. Fiscus,et al.  TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[10]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[11]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Huizhong Chen,et al.  The stanford mobile visual search data set , 2011, MMSys.

[14]  Ellen M. Voorhees,et al.  On Building Fair and Reusable Test Collections using Bandit Techniques , 2018, CIKM.

[15]  Ben Carterette The Best Published Result is Random: Sequential Testing and its Effect on Reported Effectiveness , 2015, SIGIR.

[16]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[17]  John M. Conroy,et al.  An Assessment of the Accuracy of Automatic Evaluation in Summarization , 2012, EvalMetrics@NAACL-HLT.

[18]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[19]  Emine Yilmaz,et al.  Estimating average precision when judgments are incomplete , 2007, Knowledge and Information Systems.

[20]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[22]  Marc El-Bèze,et al.  Question Answering Evaluation Survey , 2006, LREC.

[23]  Paul Over,et al.  Building Better Search Engines by Measuring Search Quality , 2014, IT Professional.

[24]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[25]  A. Lommel Blues for BLEU : Reconsidering the Validity of Reference-Based MT Evaluation , 2016 .

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[28]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[29]  Karen Spärck Jones Automatic summarising: The state of the art , 2007, Inf. Process. Manag..

[30]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[31]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[32]  Christopher Kanan,et al.  Challenges and Prospects in Vision and Language Research , 2019, Front. Artif. Intell..

[33]  Cyril W. Cleverdon,et al.  The significance of the Cranfield tests on index languages , 1991, SIGIR '91.

[34]  Wai Lam,et al.  Evaluation Challenges in Large-Scale Document Summarization , 2003, ACL.

[35]  Djoerd Hiemstra,et al.  Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002 , 2003, SIGF.

[36]  Ellen M. Voorhees,et al.  Evaluating Question Answering System Performance , 2008 .

[37]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[38]  Hugh Willmott,et al.  Challenges and prospects , 2015 .

[39]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.