Reproducibility in Computational Linguistics: Is Source Code Enough?

The availability of source code has been put forward as one of the most critical factors for improving the reproducibility of scientific research. This work studies trends in source code availability at major computational linguistics conferences, namely, ACL, EMNLP, LREC, NAACL, and COLING. We observe positive trends, especially in conferences that actively promote reproducibility. We follow this by conducting a reproducibility study of eight papers published in EMNLP 2021, finding that source code releases leave much to be desired. Moving forward, we suggest all conferences require self-contained artifacts and provide a venue to evaluate such artifacts at the time of publication. Authors can include small-scale experiments and explicit scripts to generate each result to improve the reproducibility of their work.

[1]  Simon Mille,et al.  Quantified Reproducibility Assessment of NLP Results , 2022, ACL.

[2]  Shuigeng Zhou,et al.  Weakly-supervised Text Classification Based on Keyword Graph , 2021, EMNLP.

[3]  Kenji Sagae,et al.  Automatically Exposing Problems with Neural Dialog Models , 2021, EMNLP.

[4]  Kobi Leins,et al.  Just What do You Think You're Doing, Dave?' A Checklist for Responsible Data Use in NLP , 2021, EMNLP.

[5]  Kyle Mahowald,et al.  A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space , 2021, EMNLP.

[6]  Hassan Foroosh,et al.  StreamHover: Livestream Transcript Summarization and Annotation , 2021, EMNLP.

[7]  Alyssa Lees,et al.  ReasonBERT: Pre-trained to Reason with Distant Supervision , 2021, EMNLP.

[8]  Noah A. Smith,et al.  Measuring Association Between Labels and Free-Text Rationales , 2020, EMNLP.

[9]  Autumn Toney,et al.  ValNorm Quantifies Semantics to Reveal Consistent Valence Biases Across Languages and Over Centuries , 2020, EMNLP.

[10]  Joelle Pineau,et al.  Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..

[11]  Anastasia Shimorina,et al.  The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results , 2021, INLG.

[12]  Federico Fancellu,et al.  Frustratingly Simple but Surprisingly Strong: Using Language-Independent Features for Zero-shot Cross-lingual Semantic Parsing , 2021, EMNLP.

[13]  Dan Roth,et al.  SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics , 2020, NLPOSS.

[14]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15]  Edward Raff,et al.  A Step Toward Quantifying Independently Reproducible Machine Learning Research , 2019, NeurIPS.

[16]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[17]  Gertjan van Noord,et al.  Squib: Reproducibility in Computational Linguistics: Are We Willing to Share? , 2018, CL.

[18]  Patrick D Schloss,et al.  Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research , 2018, mBio.

[19]  Odd Erik Gundersen,et al.  State of the Art: Reproducibility in Artificial Intelligence , 2018, AAAI.

[20]  Pearl Brereton,et al.  Reproducibility of studies on text mining for citation screening in systematic reviews: Evaluation and checklist , 2017, J. Biomed. Informatics.

[21]  Cyrille Rossant,et al.  Sustainable computational science: the ReScience initiative , 2017, PeerJ Comput. Sci..

[22]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[23]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .