Flagging incorrect nucleotide sequence reagents in biomedical papers: To what extent does the leading publication format impede automatic error detection?

In an idealised vision of science the scientific literature is error-free. Errors reported during peer review are supposed to be corrected prior to publication, as further research establishes new knowledge based on the body of literature. It happens, however, that errors pass through peer review, and a minority of cases errata and retractions follow. Automated screening software can be applied to detect errors in manuscripts and publications. The contribution of this paper is twofold. First, we designed the erroneous reagent checking ( ERC ) benchmark to assess the accuracy of fact-checkers screening biomedical publications for dubious mentions of nucleotide sequence reagents. It comes with a test collection comprised of 1679 nucleotide sequence reagents that were curated by biomedical experts. Second, we benchmarked our own screening software called Seek&Blastn with three input formats to assess the extent of performance loss when operating on various publication formats. Our findings stress the superiority of markup formats (a 79% detection rate on XML and HTML) over the prominent PDF format (a 69% detection rate at most) regarding an error flagging task. This is the first published baseline on error detection involving reagents reported in biomedical scientific publications. The ERC benchmark is designed to facilitate the development and validation of software bricks to enhance the reliability of the peer review process.

[1]  Jana Christopher,et al.  Systematic fabrication of scientific images revealed , 2018, FEBS letters.

[2]  Ellen M. Voorhees,et al.  TREC: Continuing information retrieval's tradition of experimentation , 2007, CACM.

[3]  Richard Van Noorden The image detective who roots out manuscript flaws , 2015 .

[4]  Paul Ginsparg,et al.  Patterns of text reuse in a scientific corpus , 2014, Proceedings of the National Academy of Sciences.

[5]  Baskaran Angathevar,et al.  A world in search of an effective growth strategy , 2015 .

[6]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[7]  Priyanka Pulla,et al.  The plan to mine the world’s research papers , 2019, Nature.

[8]  Konrad P. Kording,et al.  Bioscience-scale automated detection of figure element reuse , 2018, bioRxiv.

[9]  Cyril Labbé,et al.  Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? , 2012, Scientometrics.

[10]  Stacy L DeRuiter,et al.  RNA interference by expression of short-interfering RNAs and hairpin RNAs in mammalian cells , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  James Hartley,et al.  The delights, discomforts, and downright furies of the manuscript submission process , 2017, Learn. Publ..

[12]  Carol Tenopir,et al.  Value of academic reading and value of the library in academics' own words , 2013, Aslib Proc..

[13]  Monya Baker Problematic images found in 4% of biomedical papers , 2016, Nature.

[14]  Heidi Ledford,et al.  Nature’s 10 , 2017, Nature.

[15]  Gary D. Bader,et al.  Towards reliable named entity recognition in the biomedical domain , 2019, bioRxiv.

[16]  Jeffrey Brainard,et al.  What a massive database of retracted papers reveals about science publishing’s ‘death penalty’ , 2018, Science.

[17]  Cyril Labbé,et al.  Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines , 2017, Scientometrics.

[18]  Dieter Galea,et al.  Exploiting and assessing multi-source data for supervised biomedical named entity recognition , 2018, Bioinform..

[19]  Patrick Ruch Literature-based Discovery , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Richard Van Noorden Publishers withdraw more than 120 gibberish papers , 2014 .

[21]  Hannah Bast,et al.  A Benchmark and Evaluation for Text Extraction from PDF , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[22]  Alan Lawlor The production process , 1969 .

[23]  Hannah Bast,et al.  The Icecite Research Paper Management System , 2013, WISE.

[24]  Pippa Smart,et al.  How prevalent are plagiarized submissions? Global survey of editors , 2019, Learn. Publ..

[25]  Xiaotian Chen,et al.  Journal Retractions: Some Unique Features of Research Misconduct in China , 2018 .

[26]  Ken Pender,et al.  The production process , 2013 .

[27]  Pierre Legagneux,et al.  Don't Format Manuscripts: Journals should use a generic submission format until papers are accepted , 2009 .

[28]  Jeffrey Beck NISO Z39.96The Journal Article Tag Suite (JATS): What Happened to the NLM DTDs? , 2011, The journal of electronic publishing : JEP.

[29]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[30]  Thierry Gautier,et al.  Semi-automated fact-checking of nucleotide sequence reagents in biomedical research publications: The Seek & Blastn tool , 2019, PloS one.

[31]  Alan Singleton,et al.  Bibliometrics and Citation Analysis; from the Science Citation Index to Cybermetrics , 2010, Learn. Publ..

[32]  Michèle B. Nuijten,et al.  The prevalence of statistical reporting errors in psychology (1985–2013) , 2015, Behavior Research Methods.

[33]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..