Evaluation-as-a-Service for the Computational Sciences

Evaluation in empirical computer science is essential to show progress and assess technologies developed. Several research domains such as information retrieval have long relied on systematic evaluation to measure progress: here, the Cranfield paradigm of creating shared test collections, defining search tasks, and collecting ground truth for these tasks has persisted up until now. In recent years, however, several new challenges have emerged that do not fit this paradigm very well: extremely large data sets, confidential data sets as found in the medical domain, and rapidly changing data sets as often encountered in industry. Crowdsourcing has also changed the way in which industry approaches problem-solving with companies now organizing challenges and handing out monetary awards to incentivize people to work on their challenges, particularly in the field of machine learning. This article is based on discussions at a workshop on Evaluation-as-a-Service (EaaS). EaaS is the paradigm of not providing data sets to participants and have them work on the data locally, but keeping the data central and allowing access via Application Programming Interfaces (API), Virtual Machines (VM), or other possibilities to ship executables. The objectives of this article are to summarize and compare the current approaches and consolidate the experiences of these approaches to outline the next steps of EaaS, particularly toward sustainable research infrastructures. The article summarizes several existing approaches to EaaS and analyzes their usage scenarios and also the advantages and disadvantages. The many factors influencing EaaS are summarized, and the environment in terms of motivations for the various stakeholders, from funding agencies to challenge organizers, researchers and participants, to industry interested in supplying real-world problems for which they require solutions. EaaS solves many problems of the current research environment, where data sets are often not accessible to many researchers. Executables of published tools are equally often not available making the reproducibility of results impossible. EaaS, however, creates reusable/citable data sets as well as available executables. Many challenges remain, but such a framework for research can also foster more collaboration between researchers, potentially increasing the speed of obtaining research results.

[1]  Takehiro Yamamoto,et al.  Challenges of Multileaved Comparison in Practice: Lessons from NTCIR-13 OpenLiveQ Task , 2018, CIKM.

[2]  Mounia Lalmas,et al.  Tutorial on Metrics of User Engagement: Applications to News, Search and E-Commerce , 2018, WSDM.

[3]  Jane Greenberg,et al.  A cross-institutional analysis of data-related curricula in information science programmes: A focused look at the iSchools , 2018, J. Inf. Sci..

[4]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[5]  Frank Hopfgartner,et al.  CLEF 2017 NewsREEL Overview: A Stream-Based Recommender Task for Evaluation and Education , 2017, CLEF.

[6]  Martha Larson,et al.  A Stream-based Resource for Multi-Dimensional Evaluation of Recommender Algorithms , 2017, SIGIR.

[7]  Udo Kruschwitz,et al.  Searching the Enterprise , 2017, Found. Trends Inf. Retr..

[8]  Gauthier Chassang,et al.  The impact of the EU general data protection regulation on scientific research , 2017, Ecancermedicalscience.

[9]  J. Stephen Downie,et al.  The MIREX grand challenge: A framework of holistic user‐experience evaluation in music information retrieval , 2017, J. Assoc. Inf. Sci. Technol..

[10]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[11]  Makoto P. Kato,et al.  Overview of the NTCIR-13 OpenLiveQ Task , 2017, NTCIR.

[12]  Frank Hopfgartner,et al.  The Potentials of Recommender Systems Challenges for Student Learning , 2016, NIPS 2016.

[13]  Martha Larson,et al.  Idomaar: A Framework for Multi-dimensional Benchmarking of Recommender Algorithms , 2016, RecSys Posters.

[14]  Hwee Tou Ng,et al.  CoNLL 2016 Shared Task on Multilingual Shallow Discourse Parsing , 2016, CoNLL.

[15]  Allan Hanbury,et al.  Report on the Cloud-Based Evaluation Approaches Workshop 2015 , 2016, SIGIR Forum.

[16]  Noriko Kando,et al.  Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[17]  Filip Radlinski,et al.  Online Evaluation for Information Retrieval , 2016, Found. Trends Inf. Retr..

[18]  Heiko Paulheim,et al.  Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job , 2016, LREC.

[19]  Matthias Hagen,et al.  Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval , 2016, ECIR.

[20]  R. Watermeyer Impact in the REF: issues and obstacles , 2016 .

[21]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[22]  Matthias Hagen,et al.  Author Obfuscation: Attacking the State of the Art in Authorship Verification , 2016, CLEF.

[23]  Jimmy J. Lin,et al.  Evaluation-as-a-Service: Overview and Outlook , 2015, ArXiv.

[24]  Allan Hanbury,et al.  Creating a Large-Scale Silver Corpus from Multiple Algorithmic Segmentations , 2015, MCV@MICCAI.

[25]  Benno Stein,et al.  Overview of the PAN/CLEF 2015 Evaluation Lab , 2015, CLEF.

[26]  David Hawking,et al.  If SIGIR had an Academic Track, What Would Be In It? , 2015, SIGIR.

[27]  Hwee Tou Ng,et al.  The CoNLL-2015 Shared Task on Shallow Discourse Parsing , 2015, CoNLL.

[28]  Jimmy J. Lin,et al.  Report on the Evaluation-as-a-Service (EaaS) Expert Workshop , 2015, SIGIR Forum.

[29]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[30]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[31]  Jimmy J. Lin,et al.  Reproducible Experiments on Lexical and Temporal Feedback for Tweet Search , 2015, ECIR.

[32]  Karl Matthias,et al.  Docker : up and running , 2015 .

[33]  Krisztian Balog,et al.  Head First: Living Labs for Ad-hoc Search Evaluation , 2014, CIKM.

[34]  Katy Börner,et al.  Open data and open code for big science of science studies , 2014, Scientometrics.

[35]  Frank Hopfgartner,et al.  Benchmarking News Recommendations in a Living Lab , 2014, CLEF.

[36]  Nicola Ferro,et al.  CLEF 15th Birthday: What Can We Learn From Ad Hoc Retrieval? , 2014, CLEF.

[37]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[38]  Frank Hopfgartner,et al.  Shedding light on a living lab: the CLEF NEWSREEL open recommendation platform , 2014, IIiX.

[39]  Rob Kitchin,et al.  The data revolution : big data, open data, data infrastructures & their consequences , 2014 .

[40]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[41]  Andreas Lommatzsch,et al.  Real-Time News Recommendation Using Context-Aware Ensembles , 2014, ECIR.

[42]  Maura R. Grossman,et al.  Comments on “ The Implications of Rule 26 ( g ) on the Use of Technology-Assisted Review ” , 2014 .

[43]  Frank Hopfgartner,et al.  The plista dataset , 2013, NRS '13.

[44]  Mark Levy,et al.  Offline evaluation of recommender systems: all pain and no gain? , 2013, RepSys '13.

[45]  Jimmy J. Lin,et al.  Overview of the TREC-2013 Microblog Track , 2013, TREC.

[46]  Alistair Moffat,et al.  Panel on use of proprietary data , 2012, SIGF.

[47]  Henning Müller,et al.  Ground truth generation in medical imaging: a crowdsourcing-based iterative approach , 2012, CrowdMM '12.

[48]  Allan Hanbury,et al.  VISCERAL: Towards Large Data in Medical Imaging - Challenges and Directions , 2012, MCBR-CDS.

[49]  Allan Hanbury,et al.  Bringing the Algorithms to the Data: Cloud-Based Benchmarking for Medical Image Analysis , 2012, CLEF.

[50]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[51]  Iadh Ounis,et al.  On building a reusable Twitter corpus , 2012, SIGIR '12.

[52]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[53]  Cláudio T. Silva,et al.  Making Computations and Publications Reproducible with VisTrails , 2012, Computing in Science & Engineering.

[54]  Darrel C. Ince,et al.  The case for open computer programs , 2012, Nature.

[55]  B. Huberman Sociology of science: Big data deserve a bigger audience , 2012, Nature.

[56]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[57]  Iadh Ounis,et al.  Overview of the TREC 2011 Microblog Track , 2011, TREC.

[58]  Allan Hanbury,et al.  Automated Component-Level Evaluation: Present and Future , 2010, CLEF.

[59]  Michele Tarsilla Cochrane Handbook for Systematic Reviews of Interventions , 2010, Journal of MultiDisciplinary Evaluation.

[60]  Alan F. Smeaton,et al.  Multilingual and Multimodal Information Access Evaluation, International Conference of the Cross-Language Evaluation Forum, CLEF 2010, Padua, Italy, September 20-23, 2010. Proceedings , 2010, CLEF.

[61]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[62]  Victoria Stodden,et al.  The Legal Framework for Reproducible Scientific Research: Licensing and Copyright , 2009, Computing in Science & Engineering.

[63]  J. Glanville,et al.  Searching for Studies , 2008 .

[64]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[65]  Charles Safran,et al.  Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[66]  Ido Dagan,et al.  Evaluating Predictive Uncertainty, Visual Objects Classification and Recognising textual entailment : selected proceedings of the First PASCAL Machine Learning Challenges Workshop , 2006 .

[67]  Ido Dagan,et al.  Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers , 2006, MLCW.

[68]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[69]  Gordon V. Cormack,et al.  Spam Corpus Creation for TREC , 2005, CEAS.

[70]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[71]  Amy Jo Kim,et al.  Community Building on the Web: Secret Strategies for Successful Online Communities , 2000 .

[72]  Scott J. Wallsten,et al.  Public-Private Technology Partnerships , 1999 .

[73]  J. Shaw,et al.  Are financial incentives related to performance? A meta-analytic review of empirical research. , 1998 .

[74]  Donna K. Harman,et al.  Evaluation Issues in Information Retrieval , 1992, Inf. Process. Manag..

[75]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[76]  Phyllis A. Richmond,et al.  Review of the cranfield project , 1963 .