Detection of automatically generated texts

Automatically generated text has been used in numerous occasions with distinct intentions. It can simply go from generated comments in an online discussion to a much more mischievous task, such as manipulating bibliography information. So, this thesis first introduces different methods of generating free texts that resemble a certain topic and how those texts can be used. Therefore, we try to tackle with multiple research questions. The first question is how and what is the best method to detect a fully generated document.Then, we take it one step further to address the possibility of detecting a couple of sentences or a small paragraph of automatically generated text by proposing a new method to calculate sentences similarity using their grammatical structure. The last question is how to detect an automatically generated document without any samples, this is used to address the case of a new generator or a generator that it is impossible to collect samples from.This thesis also deals with the industrial aspect of development. A simple overview of a publishing workflow from a high-profile publisher is presented. From there, an analysis is carried out to be able to best incorporate our method of detection into the production workflow.In conclusion, this thesis has shed light on multiple important research questions about the possibility of detecting automatically generated texts in different setting. Besides the researching aspect, important engineering work in a real life industrial environment is also carried out to demonstrate that it is important to have real application along with hypothetical research.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[3]  Diego R. Amancio,et al.  Comparing the topological properties of real and artificially generated scientific manuscripts , 2015, Scientometrics.

[4]  Yimin Chen,et al.  Automatic deception detection: Methods for finding fake news , 2015, ASIST.

[5]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8]  Mehmet M. Dalkilic,et al.  Using Compression to Identify Classes of Inauthentic Texts , 2006, SDM.

[9]  Cyril Labbé,et al.  Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines , 2017, Scientometrics.

[10]  Nicolás Robinson-García,et al.  The Google scholar experiment: How to index false papers and manipulate bibliometric indicators , 2013, J. Assoc. Inf. Sci. Technol..

[11]  Jack K. Wolf,et al.  New asymptotic bounds and improvements on the Lempel-Ziv data compression algorithm , 1991, IEEE Trans. Inf. Theory.

[12]  Erik Wilde,et al.  Academic Search Engine Optimization (ASEO) , 2010 .

[13]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[14]  Allen Lavoie,et al.  Algorithmic Detection of Computer Generated Text , 2010, ArXiv.

[15]  L. McQuay,et al.  Dilution assay statistics , 1994, Journal of clinical microbiology.

[16]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[17]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[18]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Günter Neumann,et al.  Recognizing Textual Entailment Using Sentence Similarity based on Dependency Tree Skeletons , 2007, ACL-PASCAL@ACL.

[22]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[23]  Cyrille Jégourel,et al.  Measuring Global Similarity Between Texts , 2014, SLSP.

[24]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[25]  Svitlana Volkova,et al.  Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter , 2017, ACL.

[26]  Ivan Smirnov,et al.  Exactus Like: Plagiarism Detection in Scientific Texts , 2016, ECIR.

[27]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[28]  François Portet,et al.  Detection of computer generated papers in scientific literature , 2016 .

[29]  Christopher J. C. Burges,et al.  The Microsoft Research Sentence Completion Challenge , 2011 .

[30]  Nicolás Robinson-García,et al.  Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting , 2012, ArXiv.

[31]  Daniel Jurafsky,et al.  Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy , 2010, LREC.

[32]  Anselmo Peñas,et al.  UNED at PASCAL RTE-2 challenge , 2006 .

[33]  Cyril Labbé,et al.  A Tool for Literary Studies: Intertextual Distance and Tree Classification , 2005, Lit. Linguistic Comput..

[34]  José Rodríguez,et al.  Similarity of sentences through comparison of syntactic trees with pairs of similar words , 2014, 2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE).

[35]  Cyril Labbé,et al.  Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? , 2012, Scientometrics.

[36]  Cyril Labbé,et al.  Was Shakespeare's Vocabulary the Richest? , 2014 .

[37]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[38]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[39]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[40]  Tao Huang,et al.  An Effective Method to Identify Machine Automatically Generated Paper , 2009, 2009 Pacific-Asia Conference on Knowledge Engineering and Software Engineering.

[41]  Cyril Labbé,et al.  Engineering a Tool to Detect Automatically Generated Papers , 2016, BIR@ECIR.

[42]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[43]  Ilya Sochenkov,et al.  Using Sentence Similarity Measure for Plagiarism Source Retrieval , 2014, CLEF.

[44]  A. Bryk,et al.  Early vocabulary growth: Relation to language input and gender. , 1991 .

[45]  Bela Gipp,et al.  Academic Search Engine Spam and Google Scholar's Resilience Against it , 2010 .

[46]  Cyril Labbé,et al.  Named Entity Recognition Over Electronic Health Records Through a Combined Dictionary-based Approach , 2016, CENTERIS/ProjMAN/HCist.

[47]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[48]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[49]  C. Lee Giles,et al.  On the Use of Similarity Search to Detect Fake Scientific Papers , 2015, SISAP.

[50]  François Pachet,et al.  Markov Constraints for Generating Lyrics with Style , 2012, ECAI.

[51]  Cyril Labbé,et al.  Detection of Hidden Intertextuality in the Scientific Publications , 2012 .

[52]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[53]  Amber E. Boydstun,et al.  RTextTools: A Supervised Learning Package for Text Classification , 2013, R J..

[54]  Suhang Wang,et al.  Fake News Detection on Social Media: A Data Mining Perspective , 2017, SKDD.

[55]  James Hartley,et al.  What can new technology tell us about the reviewing process for journal submissions in BJET? , 2017, Br. J. Educ. Technol..