Detecting Computer-Generated Text Using Fluency and Noise Features

Computer-generated text plays a pivotal role in various applications, but the quality of the generated text is much lower than that of human-generated text. The use of artificially generated “machine text” can thus negatively affect such practical applications as website generation and text corpora collection. A method for distinguishing computer- and human-generated text is thus needed. Previous methods extract fluency features from a limited internal corpus and use them to identify the generated text. We have extended this approach to also estimate fluency using an enormous external corpus. We have also developed a method for extracting and distinguishing the noises characteristically created by a person or a machine. For example, people frequently use spoken noise words (2morrow, wanna, etc.) and misspelled ones (comin, hapy, etc.) while machines frequently generate incorrect expressions (such as untranslated phrases). A method combining these fluency and noise features was evaluated using 1000 original English messages and 1000 artificial English ones translated from Spanish. The results show that this combined method had the highest accuracy (80.35%) and the lowest equal error rate (19.44%) compared with one of state-of-the-art methods, which uses syntactic parser. Moreover, experiments using texts in other languages produced similar results, demonstrated that our proposed method works consistently across various languages.