We Need to Talk About Data: The Importance of Data Readiness in Natural Language Processing

In this paper, we identify the state of data as being an important reason for failure in applied Natural Language Processing (NLP) projects. We argue that there is a gap between academic research in NLP and its application to problems outside academia, and that this gap is rooted in poor mutual understanding between academic researchers and their non-academic peers who seek to apply research results to their operations. To foster transfer of research results from academia to non-academic settings, and the corresponding influx of requirements back to academia, we propose a method for improving the communication between researchers and external stakeholders regarding the accessibility, validity, and utility of data based on Data Readiness Levels (Lawrence, 2017). While still in its infancy, the method has been iterated on and applied in multiple innovation and research projects carried out with stakeholders in both the private and public sectors. Finally, we invite researchers and practitioners to share their experiences, and thus contributing to a body of work aimed at raising awareness of the importance of data readiness for NLP.

[1]  Eyal Shnarch,et al.  Active Learning for BERT: An Empirical Study , 2020, EMNLP.

[2]  J. Shane Culpepper,et al.  CC-News-En: A Large English News Corpus , 2020, CIKM.

[3]  Amanda Stent,et al.  Best Practices for Managing Data Annotation Projects , 2020, ArXiv.

[4]  Zhou Yu,et al.  ALICE: Active Learning with Contrastive Natural Language Explanations , 2020, EMNLP.

[5]  Blaž Škrlj,et al.  Zero-Shot Learning for Cross-Lingual News Sentiment Classification , 2020, Applied Sciences.

[6]  Huajun Chen,et al.  Zero-shot Text Classification via Reinforced Self-training , 2020, ACL.

[7]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[8]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[9]  Ben Glocker,et al.  A Standardised Approach for Preparing Imaging Data for Machine Learning Tasks in Radiology , 2019, Artificial Intelligence in Medical Imaging.

[10]  Peter M. A. van Ooijen,et al.  Quality and Curation of Medical Images and Data , 2019, Artificial Intelligence in Medical Imaging.

[11]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[12]  Claire C. Austin,et al.  A Path to Big Data Readiness , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[13]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[14]  Zachary C. Lipton,et al.  Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study , 2018, EMNLP.

[15]  Tom M. Mitchell,et al.  Zero-shot Learning of Classifiers from Natural Language Quantification , 2018, ACL.

[16]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.