Three Dimensions of Reproducibility in Natural Language Processing

Despite considerable recent attention to problems with reproducibility of scientific research, there is a striking lack of agreement about the definition of the term. That is a problem, because the lack of a consensus definition makes it difficult to compare studies of reproducibility, and thus to have even a broad overview of the state of the issue in natural language processing. This paper proposes an ontology of reproducibility in that field. Its goal is to enhance both future research and communication about the topic, and retrospective meta-analyses. We show that three dimensions of reproducibility, corresponding to three kinds of claims in natural language processing papers, can account for a variety of types of research reports. These dimensions are reproducibility of a conclusion, of a finding, and of a value. Three biomedical natural language processing papers by the authors of this paper are analyzed with respect to these dimensions.

[1]  E. von Elm,et al.  Full publication of results initially presented in abstracts. , 2007, The Cochrane database of systematic reviews.

[2]  S P Balasubramanian,et al.  Publication of surgical abstracts in full text: a retrospective cohort study. , 2006, Annals of the Royal College of Surgeons of England.

[3]  K. Bretonnel Cohen,et al.  Replicability of Research in Biomedical Natural Language Processing: a pilot evaluation for a coding task , 2016, Louhi@EMNLP.

[4]  A Daluiski,et al.  Publication of abstracts submitted to the annual meeting of the Pediatric Orthopaedic Society of North America. , 2000, Journal of pediatric orthopedics.

[5]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[6]  Walter Daelemans,et al.  Evaluation of Machine Learning Methods for Natural Language Processing Tasks , 2002, LREC.

[7]  Bente Maegaard,et al.  Evaluation of NLP systems , 1996, COLING.

[8]  António Branco,et al.  Seeking to Reproduce "Easy Domain Adaptation" , 2016 .

[9]  Zhiyong Lu,et al.  Community challenges in biomedical text mining over 10 years: success, failure and the future , 2016, Briefings Bioinform..

[10]  Chen Lin,et al.  Temporal Annotation in the Clinical Domain , 2014, TACL.

[11]  Filinto Marcelo,et al.  Fate of abstracts presented at the World Congress of Endourology: are they followed by publication in peer-reviewed journals? , 2006, Journal of endourology.

[12]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[13]  Amrith Raj Rao,et al.  Publication rate of abstracts presented at the British Association of Urological Surgeons Annual Meeting , 2006, BJU international.

[14]  Kishore Mulpuri,et al.  Publication of Abstracts Submitted to the Annual Meeting of the Pediatric Orthopaedic Society of North America: Is There a Difference Between Accepted Versus Rejected Abstracts? , 2011, Journal of pediatric orthopedics.

[15]  Peng Bi,et al.  Handbook of Linguistic Annotation , 2018, J. Quant. Linguistics.

[16]  Neil Fleshner,et al.  Publication rate of abstracts presented at the annual meeting of the American Urological Association , 2004, BJU international.

[17]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[18]  Richard M Rosenfeld,et al.  Factors influencing publication of abstracts presented at the AAO-HNS Annual Meeting , 2006, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[19]  Alan Edward Harris,et al.  The language of ideas , 1966 .

[20]  Martha Palmer,et al.  VerbNet/OntoNotes-Based Sense Annotation , 2017 .

[21]  C. Muller Principes et méthodes de statistique lexicale , 1992 .

[22]  Jacqueline Léon,et al.  La constitution du TAL , 2002 .

[23]  Christopher Gandrud,et al.  Reproducible Research with R and RStudio , 2013 .

[24]  Thomas J. Leeper Archiving Reproducible Research with R and Dataverse , 2014 .

[25]  Pearl Brereton,et al.  Reproducibility of studies on text mining for citation screening in systematic reviews: Evaluation and checklist , 2017, J. Biomed. Informatics.

[26]  Maxime Amblard,et al.  Pour un TAL responsable , 2016, Trait. Autom. des Langues.

[27]  Dominique Estival,et al.  Supporting accessibility and reproducibility in language research in the Alveo virtual laboratory , 2017, Comput. Speech Lang..

[28]  Chaim M Bell,et al.  Frequency and factors influencing publication of abstracts presented at three major nephrology meetings , 2011, International archives of medicine.

[29]  Graham Wilcock,et al.  Introduction to Linguistic Annotation and Text Analytics , 2009, Synthesis Lectures on Human Language Technologies.

[30]  Franco Moretti,et al.  Graphes, cartes et arbres : modèles abstraits pour une autre histoire de la littérature , 2008 .

[31]  Gabriel G. Bès La linguistique entre science et ingénierie , 2002 .

[32]  Lane Schwartz,et al.  Reproducible Results in Parsing-Based Machine Translation: The JHU Shared Task Submission , 2010, WMT@ACL.

[33]  Nancy Ide,et al.  Annotation Science From Theory to Practice and Use , 2022 .

[34]  Ted Pedersen,et al.  Empiricism Is Not a Matter of Faith , 2008, Computational Linguistics.

[35]  Thomas M. Cover,et al.  A convergent gambling estimate of the entropy of English , 1978, IEEE Trans. Inf. Theory.

[36]  Lewis S. Nelson,et al.  Publication of abstracts presented at 2001 NACCT , 2009, Journal of Medical Toxicology.

[37]  Wendy W. Chapman,et al.  Annotating the Clinical Text – MiPACQ, ShARe, SHARPn and THYME Corpora , 2017 .

[38]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[39]  Margot Mieskes,et al.  A Quantitative Study of Data in the NLP community , 2017, EthNLP@EACL.

[40]  Alex M. Warren Repeatability and Benefaction in Computer Systems Research — A Study and a Modest Proposal , 2015 .

[41]  Peter Herbison,et al.  Full publication of abstracts of randomised controlled trials published at International Continence Society meetings' , 2004, Neurourology and urodynamics.

[42]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[43]  James Pustejovsky,et al.  Natural Language Annotation for Machine Learning - a Guide to Corpus-Building for Applications , 2012 .

[44]  Karen Sparck Jones,et al.  Book Reviews: Evaluating Natural Language Processing Systems: An Analysis and Review , 1996, CL.

[45]  Hadley Wickham,et al.  R for Data Science: Import, Tidy, Transform, Visualize, and Model Data , 2014 .

[46]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[47]  Ellen M. Voorhees,et al.  Evaluating evaluation measure stability , 2000, SIGIR '00.

[48]  Christophe Roeder,et al.  Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE. , 2016, LREC ... International Conference on Language Resources & Evaluation : [proceedings]. International Conference on Language Resources and Evaluation.

[49]  K. Bretonnel Cohen,et al.  A Fault Model for Ontology Mapping, Alignment, and Linking Systems , 2006, Pacific Symposium on Biocomputing.

[50]  Antske Fokkens,et al.  Offspring from Reproduction Problems: What Replication Failure Teaches Us , 2013, ACL.

[51]  K. Dickersin,et al.  Full publication of results initially presented in abstracts. A meta-analysis. , 1994 .

[52]  D F Kallmes,et al.  The fate of neuroradiologic abstracts presented at national meetings in 1993: rate of subsequent publication in peer-reviewed, indexed journals. , 1999, AJNR. American journal of neuroradiology.

[53]  D. Vaux,et al.  Replicates and repeats—what is the difference and is it significant? , 2012, EMBO reports.

[54]  Piek T. J. M. Vossen,et al.  Replicability and reproducibility of research results for human language technology: introducing an LRE special section , 2017, Lang. Resour. Evaluation.

[55]  W. Byerly,et al.  Publication Rates of Abstracts from Two Pharmacy Meetings , 2000, The Annals of pharmacotherapy.

[56]  Biniyam Wondimu,et al.  Subsequent publication of abstracts presented at the International Association of Paediatric Dentistry meetings. , 2008, International journal of paediatric dentistry.

[57]  Olivier Kraif,et al.  Émilie Née (dir.): Méthodes et outils informatiques pour l’analyse des discours , 2018 .

[58]  J. Lerman,et al.  Publication of abstracts presented at anaesthesia meetings , 1993, Canadian journal of anaesthesia = Journal canadien d'anesthesie.

[59]  Martha Palmer,et al.  Current Directions in English and Arabic PropBank , 2017 .

[60]  Siegfried Blasche,et al.  Enzyklopädie Philosophie und Wissenschaftstheorie , 1995 .

[61]  Halil Kilicoglu,et al.  Biomedical Text Mining for Research Rigor and Integrity: Tasks, Challenges, Directions , 2017, bioRxiv.

[62]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[63]  R G Mark,et al.  MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring , 2002, Computers in Cardiology.

[64]  K. Bretonnel Cohen,et al.  Reproducibility in Biomedical Natural Language Processing , 2017, AMIA.

[65]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[66]  I S Whitaker,et al.  Publication rates for abstracts presented at the British Association of Plastic Surgeons meetings: how do we compare with other specialties? , 2003, British journal of plastic surgery.

[67]  James J. Masanz,et al.  Negation’s Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing , 2014, PloS one.

[68]  Adam Kilgarriff,et al.  Getting to Know Your Corpus , 2012, TSD.

[69]  M. Cori,et al.  Pour un travail épistémologique sur le TAL , 2002 .

[70]  Lynette Hirschman,et al.  The Evolution of evaluation: Lessons from the Message Understanding Conferences , 1998, Comput. Speech Lang..

[71]  Dirk Hovy,et al.  The Rating Game: Sentiment Rating Reproducibility from Text , 2015, EMNLP.

[72]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[73]  D. Sanders,et al.  Research outcomes in British gastroenterology: an audit of the subsequent full publication of abstracts presented at the British Society of Gastroenterology. , 2000, Gut.

[74]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[75]  H. Atmanspacher,et al.  Relevance relations for the concept of reproducibility , 2014, Journal of The Royal Society Interface.

[76]  D. Joseph,et al.  Publication rates of abstracts presented at annual scientific meetings: how does the Royal Australian and New Zealand College of Radiologists compare? , 2004, Australasian radiology.

[77]  K. Bretonnel Cohen,et al.  Translational Morphosyntax: Distribution of Negation in Clinical Records and Biomedical Journal Articles , 2017, MedInfo.