论文信息 - Detection of automatically generated texts

Detection of automatically generated texts

Automatically generated text has been used in numerous occasions with distinct intentions. It can simply go from generated comments in an online discussion to a much more mischievous task, such as manipulating bibliography information. So, this thesis first introduces different methods of generating free texts that resemble a certain topic and how those texts can be used. Therefore, we try to tackle with multiple research questions. The first question is how and what is the best method to detect a fully generated document.Then, we take it one step further to address the possibility of detecting a couple of sentences or a small paragraph of automatically generated text by proposing a new method to calculate sentences similarity using their grammatical structure. The last question is how to detect an automatically generated document without any samples, this is used to address the case of a new generator or a generator that it is impossible to collect samples from.This thesis also deals with the industrial aspect of development. A simple overview of a publishing workflow from a high-profile publisher is presented. From there, an analysis is carried out to be able to best incorporate our method of detection into the production workflow.In conclusion, this thesis has shed light on multiple important research questions about the possibility of detecting automatically generated texts in different setting. Besides the researching aspect, important engineering work in a real life industrial environment is also carried out to demonstrate that it is important to have real application along with hypothetical research.

Minh Tien Nguyen | M. T. Nguyen

[1] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2] Christopher D. Manning,et al. Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[3] Diego R. Amancio,et al. Comparing the topological properties of real and artificially generated scientific manuscripts , 2015, Scientometrics.

[4] Yimin Chen,et al. Automatic deception detection: Methods for finding fake news , 2015, ASIST.

[5] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[6] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8] Mehmet M. Dalkilic,et al. Using Compression to Identify Classes of Inauthentic Texts , 2006, SDM.

[9] Cyril Labbé,et al. Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines , 2017, Scientometrics.

[10] Nicolás Robinson-García,et al. The Google scholar experiment: How to index false papers and manipulate bibliometric indicators , 2013, J. Assoc. Inf. Sci. Technol..

[11] Jack K. Wolf,et al. New asymptotic bounds and improvements on the Lempel-Ziv data compression algorithm , 1991, IEEE Trans. Inf. Theory.