Detection of computer generated papers in scientific literature

Meaningless computer generated scientific texts can be used in several ways. For example, they have allowed Ike Antkare to become one of the most highly cited scientists of the modern world. Such fake publications are also appearing in real scientific conferences and, as a result, in the bibliographic services (Scopus, ISI-Web of Knowledge, Google Scholar,...). Recently, more than 120 papers have been withdrawn from subscription databases of two high-profile publishers, IEEE and Springer, because they were computer generated thanks to the SCIgen software. This software, based on a Probabilistic Context Free Grammar (PCFG), was designed to randomly generate computer science research papers. Together with PCFG, Markov Chains (MC) are the mains ways to generated Meaningless texts. This paper presents the mains characteristic of texts generated by PCFG and MC. For the time being, PCFG generators are quite easy to spot by an automatic way, using intertextual distance combined with automatic clustering, because these generators are behaving like authors with specifics features such as a very low vocabulary richness and unusual sentence structures. This shows that quantitative tools are effective to characterize originality (or banality) of authors' language.

[1]  Philip Ball,et al.  Computer conference welcomes gobbledegook paper , 2005, Nature.

[2]  François Portet,et al.  Towards an Abstractive Opinion Summarisation of Multiple Reviews in the Tourism Domain , 2012, SDAD@ECML/PKDD.

[3]  Cyril Labbé,et al.  Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? , 2012, Scientometrics.

[4]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[5]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[6]  Cyril Labbé,et al.  Was Shakespeare's Vocabulary the Richest? , 2014 .

[7]  Albert Gatt,et al.  Textual Properties and Task-based Evaluation: Investigating the Role of Surface Properties, Structure and Content , 2010, INLG.

[8]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[9]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[10]  François Pachet,et al.  Markov Constraints for Generating Lyrics with Style , 2012, ECAI.

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Allen Lavoie,et al.  Algorithmic Detection of Computer Generated Text , 2010, ArXiv.

[13]  Cyril Labbé,et al.  How to Measure the Meanings of Words? Amour in Corneille’s Work , 2005, Lang. Resour. Evaluation.

[14]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[15]  Dongwon Lee,et al.  Oracle, where shall I submit my papers? , 2009, CACM.

[16]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[17]  Paul Ginsparg Automated screening: ArXiv screens spot fake papers , 2014, Nature.

[18]  Ruli Manurung,et al.  THE CONSTRUCTION OF A PUN GENERATOR FOR LANGUAGE SKILLS DEVELOPMENT , 2008, Appl. Artif. Intell..

[19]  M. Stubbs Text and corpus analysis , 1996 .

[20]  Dominique Labbé,et al.  VOCABULARY RICHNESS , 2001 .

[21]  Edward James Arnold Le discours de Tony Blair (1997-2004) , 2005 .

[22]  Cyril Labbé,et al.  Les styles discursifs des premiers ministres québécois de Jean Lesage à Jean Charest , 2008, Canadian Journal of Political Science/Revue canadienne de science politique.

[23]  Mehmet M. Dalkilic,et al.  Using Compression to Identify Classes of Inauthentic Texts , 2006, SDM.

[24]  David Lorge Parnas,et al.  Stop the numbers game , 2007, CACM.

[25]  Cyril Labbé Ike Antkare one of the great stars in the scientific firmament , 2010 .

[26]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[27]  Cyrille Jégourel,et al.  Measuring Structural Distances between Texts , 2014, ArXiv.

[28]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[29]  Richard Van Noorden Publishers withdraw more than 120 gibberish papers , 2014 .

[30]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[31]  Edward James Arnold,et al.  Le sens des mots chez Tony Blair (people et Europe) , 2008 .

[32]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[33]  Jim Hunter,et al.  Automatic Generation of Textual Summaries from Neonatal Intensive Care Data , 2007, AIME.

[34]  Chris Mellish Computational and Quantitative Studies , 2006, Computational Linguistics.

[35]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.