Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

[1]  Alec Radford,et al.  Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[2]  Max Welling,et al.  Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement , 2019, ICML.

[3]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[4]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[5]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[6]  A. Azzouz 2011 , 2020, City.

[7]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[9]  W. Bruce Croft,et al.  Quality-biased ranking of web documents , 2011, WSDM '11.

[10]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[11]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[12]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[13]  Mahdieh Soleymani Baghshah,et al.  Jointly Measuring Diversity and Quality in Text Generation Models , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[14]  Alexander M. Rush,et al.  GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[15]  Michael Strube,et al.  A Neural Local Coherence Model for Text Quality Assessment , 2018, EMNLP.

[16]  Sameer Badaskar,et al.  Identifying Real or Fake Articles: Towards better Language Modeling , 2008, IJCNLP.

[17]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[18]  Jason Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2020, ICLR.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Andrew Tomkins,et al.  Reverse Engineering Configurations of Neural Text Generation Models , 2020, ACL.

[21]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[22]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  蕭瓊瑞撰述,et al.  2009 , 2019, The Winning Cars of the Indianapolis 500.

[24]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[25]  Marc'Aurelio Ranzato,et al.  Real or Fake? Learning to Discriminate Machine from Human Generated Text , 2019, ArXiv.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[28]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[31]  Dimitrios Alikaniotis,et al.  The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction , 2019, BEA@ACL.

[32]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[33]  Berkant Barla Cambazoglu,et al.  Linguistic Benchmarks of Online News Article Quality , 2016, ACL.

[34]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..