Evaluation-driven research in data science: Leveraging cross-field methodologies

While prior evaluation methodologies for data-science research have focused on efficient and effective teamwork on independent data science problems within given fields [1], this paper argues that an enriched notion of evaluation-driven research (EDR) supports methodologies and effective solutions to data-science problems across multiple fields. We adopt the view that progress in data-science research is enriched through the examination of a range of problems in many different areas (traffic, healthcare, finance, sports, etc.) and through the development of methodologies and evaluation paradigms that span diverse disciplines, domains, problems, and tasks. A number of questions arise when one considers the multiplicity of data science fields and the potential for cross-disciplinary “sharing” of methodologies, for example: the feasibility of generalizing problems, tasks, and metrics across domains; ground-truth considerations for different types of problems; issues related to data uncertainty in different fields; and the feasibility of enabling cross-field cooperation to encourage diversity of solutions. We posit that addressing the problems inherent in such questions provides a foundation for EDR across diverse fields. We ground our conclusions and insights in a brief preliminary study developed within the Information Access Division of the National Institute of Standards and Technology as a part of a new Data Science Research Program (DSRP). The DSRP focuses on this cross-disciplinary notion of EDR and includes a new Data Science Evaluation series to facilitate research collaboration, to leverage shared technology and infrastructure, and to further build and strengthen the data-science community.

[1]  Nizar Habash,et al.  Hybrid Natural Language Generation from Lexical Conceptual Structures , 2003, Machine Translation.

[2]  Gagan Agrawal,et al.  Towards methods for systematic research on big data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[3]  Martial Michel,et al.  The NIST data science initiative , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[4]  Susan Elliott Sim,et al.  Using benchmarking to advance research: a challenge to software engineering , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[5]  Maarten Sierhuis,et al.  The Fundamental Principle of Coactive Design: Interdependence Must Shape Autonomy , 2010, COIN@AAMAS&MALLOW.

[6]  Alvin F. Martin,et al.  NIST speaker recognition evaluation chronicles , 2004, Odyssey.

[7]  David Pallett,et al.  A look at NIST'S benchmark ASR tests: past, present, and future , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[8]  William J. Byrne,et al.  A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[9]  Alvin F. Martin,et al.  NIST Speaker Recognition Evaluation Chronicles - Part 2 , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[10]  S. D. Roberts,et al.  Quantifying uncertainty in medical decisions. , 1989, Journal of the American College of Cardiology.

[11]  Jonathan Stacks,et al.  Developmental Evaluation , 2011, Health promotion practice.

[12]  Jimmy J. Lin,et al.  Evaluation-as-a-Service: Overview and Outlook , 2015, ArXiv.

[13]  Erhard Rahm,et al.  The Scholarly Impact of CLEF (2000-2009) , 2013, CLEF.

[14]  John D. Prange EVALUATION DRIVEN RESEARCH: The Foundation of the TIPSTER Text Program , 1996, TIPSTER.

[15]  Jeffrey S. Saltz,et al.  The need for new processes, methodologies and tools to support big data teams and improve big data project effectiveness , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Hui Xiong,et al.  Clustering Validation Measures , 2018, Data Clustering: Algorithms and Applications.

[17]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[18]  Sunita Sarawagi,et al.  Active Evaluation of Classifiers on Large Datasets , 2012, 2012 IEEE 12th International Conference on Data Mining.

[19]  Martin Meckesheimer,et al.  Automatic outlier detection for time series: an application to sensor data , 2007, Knowledge and Information Systems.

[20]  J. Shapiro George H. Heilmeier , 1994, IEEE Spectrum.

[21]  James G. Lyons,et al.  Protein fold recognition using HMM-HMM alignment and dynamic programming. , 2016, Journal of theoretical biology.

[22]  Martial Michel,et al.  A new data science research program: evaluation, metrology, standards, and community outreach , 2016, International Journal of Data Science and Analytics.

[23]  M. Webb,et al.  Quantification of modelling uncertainties in a large ensemble of climate change simulations , 2004, Nature.

[24]  Douglas A. Reynolds Speaker and language recognition: a guided safari , 2008, Odyssey.

[25]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[26]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[27]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[28]  Martial Michel,et al.  The NIST IAD Data Science Research Program , 2015 .