Prevalence of nonsensical algorithmically generated papers in the scientific literature

In 2014 leading publishers withdrew more than 120 nonsensical publications automatically generated with the SCIgen program. Casual observations suggested that similar problematic papers are still published and sold, without follow‐up retractions. No systematic screening has been performed and the prevalence of such nonsensical publications in the scientific literature is unknown. Our contribution is 2‐fold. First, we designed a detector that combs the scientific literature for grammar‐based computer‐generated papers. Applied to SCIgen, it has a 83.6% precision. Second, we performed a scientometric study of the 243 detected SCIgen‐papers from 19 publishers. We estimate the prevalence of SCIgen‐papers to be 75 per million papers in Information and Computing Sciences. Only 19% of the 243 problematic papers were dealt with: formal retraction (12) or silent removal (34). Publishers still serve and sometimes sell the remaining 197 papers without any caveat. We found evidence of citation manipulation via edited SCIgen bibliographies. This work reveals metric gaming up to the point of absurdity: fraudsters publish nonsensical algorithmically generated papers featuring genuine references. It stresses the need to screen papers for nonsense before peer‐review and chase citation manipulation in published papers. Overall, this is yet another illustration of the harmful effects of the pressure to publish or perish.

[1]  Next chapter in artificial writing , 2020, Nature Machine Intelligence.

[2]  Nees Jan van Eck,et al.  Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic , 2020, Quantitative Science Studies.

[3]  Emilio Delgado López-Cózar,et al.  Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations , 2020, Scientometrics.

[4]  D. Moher,et al.  Citations and metrics of journals discontinued from Scopus for publication concerns: the GhoS(t)copus Project , 2020, bioRxiv.

[5]  Stacy Konkiel,et al.  Dimensions: Bringing down barriers between scientometricians and data , 2020, Quantitative Science Studies.

[6]  Priyanka Pulla,et al.  The plan to mine the world’s research papers , 2019, Nature.

[7]  Anne-Wil Harzing Two new kids on the block: How do Crossref and Dimensions compare with Google Scholar, Microsoft Academic, Scopus and the Web of Science? , 2019, Scientometrics.

[8]  Zeev Volkovich,et al.  Detection of Computer-Generated Papers Using One-Class SVM and Cluster Approaches , 2018, MLDM.

[9]  Minh Tien Nguyen,et al.  Detecting automatically generated sentences with grammatical structure similarity , 2018, Scientometrics.

[10]  Holly Else,et al.  How I scraped data from Google Scholar , 2018 .

[11]  Minh Tien Nguyen,et al.  Detection of automatically generated texts , 2018 .

[12]  D. Chawla Mystery as controversial list of predatory publishers disappears , 2017 .

[13]  Guillaume Cabanac,et al.  Bibliogifts in LibGen? A study of a text‐sharing platform driven by biblioleaks and crowdsourcing , 2016, J. Assoc. Inf. Sci. Technol..

[14]  C. Lee Giles,et al.  On the Use of Similarity Search to Detect Fake Scientific Papers , 2015, SISAP.

[15]  Diego R. Amancio,et al.  Comparing the topological properties of real and artificially generated scientific manuscripts , 2015, Scientometrics.

[16]  J. Bohannon Scientific publishing. Hoax-detecting software spots fake papers. , 2015, Science.

[17]  Richard Van Noorden Google Scholar pioneer on search engine’s future , 2014 .

[18]  Richard Van Noorden Publishers withdraw more than 120 gibberish papers , 2014 .

[19]  Dragan Djuric,et al.  Penetrating the Omerta of Predatory Publishing: The Romanian Connection , 2014, Science and Engineering Ethics.

[20]  Emilio Delgado López-Cózar,et al.  The Google scholar experiment: How to index false papers and manipulate bibliometric indicators , 2013, J. Assoc. Inf. Sci. Technol..

[21]  Cyril Labbé,et al.  Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? , 2012, Scientometrics.

[22]  Allen Lavoie,et al.  Algorithmic Detection of Computer Generated Text , 2010, ArXiv.

[23]  Tao Huang,et al.  An Effective Method to Identify Machine Automatically Generated Paper , 2009, 2009 Pacific-Asia Conference on Knowledge Engineering and Software Engineering.

[24]  Dongwon Lee,et al.  Measuring conference quality by mining program committee characteristics , 2007, JCDL '07.

[25]  Philip Ball,et al.  Computer conference welcomes gobbledegook paper , 2005, Nature.

[26]  Amin Vahdat,et al.  Consistent and automatic replica regeneration , 2004, TOS.

[27]  Ike Antkare Ike Antkare, His Publications, and Those of His Disciples , 2020, Gaming the Metrics.

[28]  Jacques Savoy,et al.  Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling , 2020 .

[29]  PubPeer: Scientific Assessment Without Metrics , 2020, Gaming the Metrics.

[30]  Marjorie M. K. Hlava The data you have... Tomorrow's information business , 2016, Inf. Serv. Use.

[31]  François Portet,et al.  Detection of computer generated papers in scientific literature , 2016 .

[32]  D. R. Amancio Comparing the writing style of real and artificial papers , 2015, ArXiv.

[33]  Mehmet M. Dalkilic,et al.  Using Compression to Identify Classes of Inauthentic Texts , 2006, SDM.

[34]  D. Munson A note on Lena , 1996 .

[35]  Andrew C. Bulhak On the Simulation of Postmodernism and Mental Debility using Recursive Transition Networks , 1996 .