A new data science research program: evaluation, metrology, standards, and community outreach

This article examines foundational issues in data science including current challenges, basic research questions, and expected advances, as the basis for a new data science research program (DSRP) and associated data science evaluation (DSE) series, introduced by the National Institute of Standards and Technology (NIST) in the fall of 2015. The DSRP is designed to facilitate and accelerate research progress in the field of data science and consists of four components: evaluation and metrology, standards, compute infrastructure, and community outreach. A key part of the evaluation and measurement component is the DSE. The DSE series aims to address logistical and evaluation design challenges while providing rigorous measurement methods and an emphasis on generalizability rather than domain- and application-specific approaches. Toward that end, each year the DSE will consist of multiple research tracks and will encourage the application of tasks that span these tracks. The evaluations are intended to facilitate research efforts and collaboration, leverage shared infrastructure, and effectively address crosscutting challenges faced by diverse data science communities. Multiple research tracks will be championed by members of the data science community with the goal of enabling rigorous comparison of approaches through common tasks, datasets, metrics, and shared research challenges. The tracks will permit us to measure several different data science technologies in a wide range of fields and will address computing infrastructure, standards for an interoperability framework, and domain-specific examples. This article also summarizes lessons learned from the data science evaluation series pre-pilot that was held in fall of 2015.

[1]  J. Pearl Causal inference in statistics: An overview , 2009 .

[2]  Werner Bailer,et al.  A Novel Metadata Standard for Multimedia Preservation , 2014, iPRES.

[3]  S. Jørgensen The art of computer systems performance analysis: Techniques for Experimental Design, Measurement, Simulation and Modeling. Raj Jain. John Wiley, New York. Hardcover, 720 p. U.S. $52.95. , 1992 .

[4]  James Llinas,et al.  Multisensor Data Fusion , 1990 .

[5]  Dan Suciu,et al.  Bringing Provenance to Its Full Potential Using Causal Reasoning , 2011, TaPP.

[6]  Holger H. Hoos,et al.  A Parallel Workflow for Real-time Correlation and Clustering of High-Frequency Stock Market Data , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[7]  Ben Shneiderman,et al.  The Craft of Information Visualization: Readings and Reflections , 2003 .

[8]  Lise Getoor,et al.  Using Semantics & Statistics to Turn Data into Knowledge , 2014 .

[9]  J. N. Lott THE QUALITY CONTROL OF THE INTEGRATED SURFACE HOURLY DATABASE , 2022 .

[10]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[11]  Peter Buneman,et al.  FST TCS 2000: Foundations of Software Technology and Theoretical Computer Science , 2000 .

[12]  D. George Understanding Structural and Semantic Heterogeneity in the Context of Database Schema Integration , 2006 .

[13]  Sunil Prabhakar,et al.  A Statistical Method for Integrated Data Cleaning and Imputation , 2009 .

[14]  Wray L. Buntine,et al.  Special session on trends & controversies in data science (TCDS) , 2015, DSAA.

[15]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[16]  Steffen Bickel,et al.  Active Risk Estimation , 2010, ICML.

[17]  José C. Cunha,et al.  Parallel program development for cluster computing: methodology, tools and integrated environments , 2001 .

[18]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[19]  Konstantinos Kalpakis,et al.  Spatio-temporal coupled Bayesian Robust Principal Component Analysis for road traffic event detection , 2013, 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013).

[20]  Eugenia Kalnay,et al.  Atmospheric Modeling, Data Assimilation and Predictability , 2002 .

[21]  Vikas Joshi,et al.  Information Fusion Based Learning for Frugal Traffic State Sensing , 2013, IJCAI.

[22]  Donna K. Harman,et al.  Overview of the first TREC conference , 1993, SIGIR.

[23]  Martial Michel,et al.  The NIST data science initiative , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[24]  Sal Speaker,et al.  Big Data and Data Science: Some Hype but Real Opportunities , .

[25]  Philip S. Yu,et al.  Structural Diversity for Privacy in Publishing Social Networks , 2011, SDM.

[26]  Eric Yu,et al.  Conceptual Modeling: Foundations and Applications: Essays in Honor of John Mylopoulos , 2009 .

[27]  Tina Hesman Saey Big data, big challenges: As researchers begin analyzing massive datasets, Opportunities for chaos and errors multiply , 2015 .

[28]  Donna K. Harman,et al.  The DARPA TIPSTER project , 1992, SIGF.

[29]  Olivier Talagrand,et al.  Assimilation of Observations, an Introduction (gtSpecial IssueltData Assimilation in Meteology and Oceanography: Theory and Practice) , 1997 .

[30]  Anupam Datta,et al.  Privacy through Accountability: A Computer Science Perspective , 2014, ICDCIT.

[31]  Steven Finlay,et al.  Predictive Analytics, Data Mining and Big Data , 2014 .

[32]  Srinivasan Parthasarathy,et al.  Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014 , 2014, SDM.

[33]  Juan Liu,et al.  Workflow-based Human-in-the-Loop Data Analytics , 2014, HCBDR '14.

[34]  James Llinas,et al.  Multisensor Data Fusion , 1990 .

[35]  Y. Matsuo,et al.  Real-time event extraction for driving information from social sensors , 2012, 2012 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[36]  David H. Laidlaw,et al.  Thoughts on User Studies: Why, How and When , 1993 .

[37]  Victoria Interrante,et al.  User Studies: Why, How, and When? , 2003, IEEE Computer Graphics and Applications.

[38]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[39]  Reza Zafarani,et al.  Evaluation without ground truth in social media research , 2015, Commun. ACM.

[40]  Alvin F. Martin,et al.  NIST speaker recognition evaluation chronicles , 2004, Odyssey.

[41]  Bonnie J. Dorr,et al.  Machine Translation Evaluation and Optimization , 2011 .

[42]  Christopher A. Badurek,et al.  Review of Information visualization in data mining and knowledge discovery by Usama Fayyad, Georges G. Grinstein, and Andreas Wierse. Morgan Kaufmann 2002 , 2003 .

[43]  M. Sheelagh T. Carpendale,et al.  Empirical Studies in Information Visualization: Seven Scenarios , 2012, IEEE Transactions on Visualization and Computer Graphics.

[44]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[45]  Kathleen M. Carley,et al.  Spatiotemporal Network Analysis and Visualization , 2015, Int. J. Appl. Geospat. Res..

[46]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[47]  Craig A. Knoblock,et al.  Exploiting Semantics for Big Data Integration , 2015, AI Mag..

[48]  Download Book,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[49]  Hu Bin,et al.  An integrative software system for biomedical information analysis workflow , 2009, 2009 International Conference on Future BioMedical Information Engineering (FBIE).

[50]  Albert N. Link,et al.  Economic impact assessment of NIST's text REtrieval conference (TREC) program. Final report , 2010 .

[51]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[52]  Min Chen,et al.  What is Visualization Really for? , 2013, ArXiv.

[53]  Lorraine V Klerman,et al.  Challenges in data collection, analysis, and distribution of information in community coalition demonstration projects. , 2005, The Journal of adolescent health : official publication of the Society for Adolescent Medicine.

[54]  Colin Ware,et al.  Information Visualization: Perception for Design , 2000 .

[55]  Christopher Ré,et al.  Probabilistic databases , 2011, SIGA.

[56]  W. Jatmiko,et al.  Traffic intelligent system architecture based on social media information , 2012, 2012 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[57]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[58]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[59]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[60]  Thomas Hofmann,et al.  Predicting Structured Data (Neural Information Processing) , 2007 .

[61]  Wo L. Chang,et al.  The NIST IAD Data Science Evaluation Series: Part of the NIST Information Access Division Data Science Research Program | NIST , 2015 .

[62]  Jean-Marc Vincent,et al.  Monitoring parallel programs for performance tuning in cluster environments , 2001 .

[63]  Tobias Isenberg,et al.  A Systematic Review on the Practice of Evaluating Visualization , 2013, IEEE Transactions on Visualization and Computer Graphics.

[64]  M. C. Jones,et al.  E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation: Commentary on Fix and Hodges (1951) , 1989 .

[65]  Douglas A. Reynolds Speaker and language recognition: a guided safari , 2008, Odyssey.

[66]  Roman Pyzh,et al.  Impact of analytic provenance in genome analysis , 2014, BMC Genomics.

[67]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[68]  Charu C. Aggarwal,et al.  On Anonymization of Multi-graphs , 2011, SDM.

[69]  Isabel Meirelles,et al.  Design for Information: An Introduction to the Histories, Theories, and Best Practices Behind Effective Information Visualizations , 2013 .

[70]  Timothy W. Finin,et al.  Entity Type Recognition for Heterogeneous Semantic Graphs , 2013, AI Mag..

[71]  Steven Finlay,et al.  Predictive Analytics, Data Mining and Big Data: Myths, Misconceptions and Methods , 2014 .

[72]  Donna Harman,et al.  Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[73]  Sunita Sarawagi,et al.  Active Evaluation of Classifiers on Large Datasets , 2012, 2012 IEEE 12th International Conference on Data Mining.

[74]  Lise Getoor,et al.  Using Semantics and Statistics to Turn Data into Knowledge , 2015, AI Mag..

[75]  Anshul Mittal,et al.  Stock Prediction Using Twitter Sentiment Analysis , 2011 .

[76]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[77]  Nitin Madnani,et al.  Chapter 5: Machine Translation Evaluation and Optimization , 2011 .

[78]  Sanjeev Khanna,et al.  Data Provenance: Some Basic Issues , 2000, FSTTCS.

[79]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[80]  Mark A. Przybocki,et al.  The NIST data science evaluation series: Part of the NIST information access division data science initiative , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[81]  Konstantinos Kalpakis,et al.  Detecting Road Traffic Events by Coupling Multiple Timeseries With a Nonparametric Bayesian Method , 2014, IEEE Transactions on Intelligent Transportation Systems.

[82]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[83]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[84]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.