Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Language models increasingly rely on massive 001 web dumps for diverse text data. However, 002 these sources are rife with undesirable content. 003 As such, resources like Wikipedia, books, and 004 news often serve as anchors for automatically 005 selecting web text most suitable for language 006 modeling, a process typically referred to as 007 quality filtering. Using a new dataset of U.S. 008 high school newspaper articles—written by stu009 dents from across the country—we investigate 010 whose language is preferred by the quality fil011 ter used for GPT-3. We find that newspapers 012 from larger schools, located in wealthier, edu013 cated, and urban ZIP codes are more likely to 014 be classified as high quality. We then demon015 strate that the filter’s measurement of quality 016 is unaligned with other sensible metrics, such 017 as factuality or literary acclaim. We argue that 018 privileging any corpus as high quality entails a 019 language ideology, and more care is needed to 020 construct training corpora for language models, 021 with better transparency and justification for 022 the inclusion or exclusion of various texts. 023

[1]  J. Gibson A study of the status of high school newspapers in the Virginia public schools , 1961 .

[2]  W. Labov The social stratification of English in New York City , 1969 .

[3]  J. Rickford Ethnicity as a Sociolinguistic Boundary , 1985 .

[4]  Penelope Eckert,et al.  Jocks and Burnouts: Social Categories and Identity in the High School , 1989 .

[5]  Robert DiNicola Teaching Journalistic Style with the AP Stylebook: Beyond Fussy Rules and Dogma of ‘Correctness’ , 1994 .

[6]  J. T. Irvine,et al.  The boundaries of languages and disciplines : How ideologies construct difference , 1995 .

[7]  Larry V. Hedges,et al.  The Effect of School Resources on Student Achievement , 1996 .

[8]  Sergio Paulo Benevides,et al.  Silencing the past: power and the production of history , 1999 .

[9]  H. Goldstein On Boundaries , 1999 .

[10]  J. Betts,et al.  Equal Resources, Equal Outcomes? The Distribution of School Resources and Student Achievement in California , 2000 .

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Stephanie Lindemann Who speaks “broken English”? US undergraduates’ perceptions of non‐native English1 , 2005 .

[13]  Aaron Halfaker,et al.  Wikipedians are born, not made: a study of power editors on Wikipedia , 2009, GROUP.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Tyler Baldwin,et al.  Beyond Normalization: Pragmatics of Word Form in Text Messages , 2011, IJCNLP.

[16]  F. Vultee A PALEONTOLOGY OF STYLE , 2012 .

[17]  Katherine L. Milkman,et al.  What Makes Online Content Viral? , 2012 .

[18]  S. Decker The silence of the archives: business history, post-colonialism and archival ethnography , 2013 .

[19]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[20]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[21]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[22]  David Bamman,et al.  Unsupervised Discovery of Biographical Structure from Text , 2014, TACL.

[23]  Sean F. Reardon,et al.  60 Years After Brown: Trends and Consequences of School Segregation , 2014 .

[24]  David García,et al.  It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia , 2015, ICWSM.

[25]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Mounia Lalmas,et al.  First Women, Second Sex: Gender Bias in Wikipedia , 2015, HT.

[27]  Graeme Hirst,et al.  GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus , 2015, CLfL@NAACL-HLT.

[28]  J. Rickford,et al.  Language and linguistics on trial: Hearing Rachel Jeantel (and other vernacular speakers) in the courtroom and beyond , 2016 .

[29]  Steve Bien-Aimé AP Stylebook normalizes sports as a male space , 2016 .

[30]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[31]  Juan F. Restrepo,et al.  Linguistic Discrimination in an English Language Teaching Program: Voices of the Invisible Others , 2016 .

[32]  Kenneth M. Johnson,et al.  Political Polarization along the Rural-Urban Continuum? The Geography of the Presidential Vote, 2000–2016 , 2017 .

[33]  Sorin Adam Matei,et al.  Structural Differentiation in Social Media , 2017, Lecture Notes in Social Networks.

[34]  M. Williams,et al.  Towards an Ethical Framework for Publishing Twitter Data in Social Research: Taking into Account Users’ Views, Online Context and Algorithmic Estimation , 2017, Sociology.

[35]  J. Rosa,et al.  Unsettling race and language: Toward a raciolinguistic perspective , 2017, Language in Society.

[36]  Sarah Brayne Big Data Surveillance: The Case of Policing , 2017, American sociological review.

[37]  Preslav Nakov,et al.  Predicting Factuality of Reporting and Bias of News Media Sources , 2018, EMNLP.

[38]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[39]  J. Macswan,et al.  Academic English as standard language ideology: A renewed research agenda for asset-based language education , 2018, Language Teaching Research.

[40]  Alex S. Taylor,et al.  Let's Talk About Race: Identity, Chatbots, and AI , 2018, CHI.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[45]  R. Queen,et al.  Language and Discrimination: Generating Meaning, Perceiving Identities, and Discriminating Outcomes , 2020 .

[46]  Kris McGuffie,et al.  The Radicalization Risks of GPT-3 and Advanced Neural Language Models , 2020, ArXiv.

[47]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[48]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[49]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[50]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[51]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[52]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[53]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[54]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[55]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[56]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[57]  Casey Fiesler,et al.  No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service , 2020, ICWSM.

[58]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[59]  Jack Bandy,et al.  Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus , 2021, NeurIPS Datasets and Benchmarks.

[60]  Maarten Sap,et al.  Documenting the English Colossal Clean Crawled Corpus , 2021, ArXiv.

[61]  James Zou,et al.  Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.

[62]  Yejin Choi,et al.  Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection , 2021, ArXiv.

[63]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[64]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[65]  Leo Gao An Empirical Exploration in Quality Filtering of Text Data , 2021, ArXiv.

[66]  Barbara Plank,et al.  MultiLexNorm: A Shared Task on Multilingual Lexical Normalization , 2021, Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021).

[67]  David Bamman,et al.  Characterizing English Variation across Social Media Communities with BERT , 2021, Transactions of the Association for Computational Linguistics.

[68]  Richard Frank,et al.  Upvoting extremism: Collective identity formation and the extreme right on Reddit , 2020, New Media Soc..

[69]  Dong Nguyen,et al.  On learning and representing social meaning in NLP: a sociolinguistic perspective , 2021, NAACL.

[70]  Su Lin Blodgett Sociolinguistically Driven Approaches for Just Natural Language Processing , 2021 .

[71]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[72]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[73]  Diyi Yang,et al.  The Importance of Modeling Social Factors of Language: Theory and Practice , 2021, NAACL.

[74]  Dmytro Okhonko,et al.  HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.

[75]  Courtney Heldreth,et al.  “I don’t Think These Devices are Very Culturally Sensitive.”—Impact of Automated Speech Recognition Errors on African Americans , 2021, Frontiers in Artificial Intelligence.

[76]  K. Coussement,et al.  What makes people share political content on social media? The role of emotion, authority and ideology , 2021, Comput. Hum. Behav..