Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
暂无分享,去创建一个
Noah A. Smith | Suchin Gururangan | Dallas Card | Noah A. Smith | Luke Zettlemoyer | Zeyu Wang | Sarah K. Drier | Emily K. Gade | Leroy Z. Wang | Noah A. Smith | Luke Zettlemoyer | Suchin Gururangan | Dallas Card | E. K. Gade | Leroy Z. Wang | Zeyu Wang
[1] J. Gibson. A study of the status of high school newspapers in the Virginia public schools , 1961 .
[2] W. Labov. The social stratification of English in New York City , 1969 .
[3] J. Rickford. Ethnicity as a Sociolinguistic Boundary , 1985 .
[4] Penelope Eckert,et al. Jocks and Burnouts: Social Categories and Identity in the High School , 1989 .
[5] Robert DiNicola. Teaching Journalistic Style with the AP Stylebook: Beyond Fussy Rules and Dogma of ‘Correctness’ , 1994 .
[6] J. T. Irvine,et al. The boundaries of languages and disciplines : How ideologies construct difference , 1995 .
[7] Larry V. Hedges,et al. The Effect of School Resources on Student Achievement , 1996 .
[8] Sergio Paulo Benevides,et al. Silencing the past: power and the production of history , 1999 .
[9] H. Goldstein. On Boundaries , 1999 .
[10] J. Betts,et al. Equal Resources, Equal Outcomes? The Distribution of School Resources and Student Achievement in California , 2000 .
[11] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..
[12] Stephanie Lindemann. Who speaks “broken English”? US undergraduates’ perceptions of non‐native English1 , 2005 .
[13] Aaron Halfaker,et al. Wikipedians are born, not made: a study of power editors on Wikipedia , 2009, GROUP.
[14] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..
[15] Tyler Baldwin,et al. Beyond Normalization: Pragmatics of Word Form in Text Messages , 2011, IJCNLP.
[16] F. Vultee. A PALEONTOLOGY OF STYLE , 2012 .
[17] Katherine L. Milkman,et al. What Makes Online Content Viral? , 2012 .
[18] S. Decker. The silence of the archives: business history, post-colonialism and archival ethnography , 2013 .
[19] Jacob Eisenstein,et al. What to do about bad language on the internet , 2013, NAACL.
[20] Martin Chodorow,et al. TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .
[21] Eric P. Xing,et al. Diffusion of Lexical Change in Social Media , 2012, PloS one.
[22] David Bamman,et al. Unsupervised Discovery of Biographical Structure from Text , 2014, TACL.
[23] Sean F. Reardon,et al. 60 Years After Brown: Trends and Consequences of School Segregation , 2014 .
[24] David García,et al. It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia , 2015, ICWSM.
[25] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[26] Mounia Lalmas,et al. First Women, Second Sex: Gender Bias in Wikipedia , 2015, HT.
[27] Graeme Hirst,et al. GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus , 2015, CLfL@NAACL-HLT.
[28] J. Rickford,et al. Language and linguistics on trial: Hearing Rachel Jeantel (and other vernacular speakers) in the courtroom and beyond , 2016 .
[29] Steve Bien-Aimé. AP Stylebook normalizes sports as a male space , 2016 .
[30] Brendan T. O'Connor,et al. Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.
[31] Juan F. Restrepo,et al. Linguistic Discrimination in an English Language Teaching Program: Voices of the Invisible Others , 2016 .
[32] Kenneth M. Johnson,et al. Political Polarization along the Rural-Urban Continuum? The Geography of the Presidential Vote, 2000–2016 , 2017 .
[33] Sorin Adam Matei,et al. Structural Differentiation in Social Media , 2017, Lecture Notes in Social Networks.
[34] M. Williams,et al. Towards an Ethical Framework for Publishing Twitter Data in Social Research: Taking into Account Users’ Views, Online Context and Algorithmic Estimation , 2017, Sociology.
[35] J. Rosa,et al. Unsettling race and language: Toward a raciolinguistic perspective , 2017, Language in Society.
[36] Sarah Brayne. Big Data Surveillance: The Case of Policing , 2017, American sociological review.
[37] Preslav Nakov,et al. Predicting Factuality of Reporting and Bias of News Media Sources , 2018, EMNLP.
[38] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[39] J. Macswan,et al. Academic English as standard language ideology: A renewed research agenda for asset-based language education , 2018, Language Teaching Research.
[40] Alex S. Taylor,et al. Let's Talk About Race: Identity, Chatbots, and AI , 2018, CHI.
[41] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[42] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[43] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[44] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[45] R. Queen,et al. Language and Discrimination: Generating Meaning, Perceiving Identities, and Discriminating Outcomes , 2020 .
[46] Kris McGuffie,et al. The Radicalization Risks of GPT-3 and Advanced Neural Language Models , 2020, ArXiv.
[47] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[48] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.
[49] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[50] Timnit Gebru,et al. Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.
[51] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[52] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.
[53] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[54] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[55] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.
[56] Solon Barocas,et al. Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.
[57] Casey Fiesler,et al. No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service , 2020, ICWSM.
[58] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.
[59] Jack Bandy,et al. Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus , 2021, NeurIPS Datasets and Benchmarks.
[60] Maarten Sap,et al. Documenting the English Colossal Clean Crawled Corpus , 2021, ArXiv.
[61] James Zou,et al. Persistent Anti-Muslim Bias in Large Language Models , 2021, AIES.
[62] Yejin Choi,et al. Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection , 2021, ArXiv.
[63] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.
[64] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.
[65] Leo Gao. An Empirical Exploration in Quality Filtering of Text Data , 2021, ArXiv.
[66] Barbara Plank,et al. MultiLexNorm: A Shared Task on Multilingual Lexical Normalization , 2021, Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021).
[67] David Bamman,et al. Characterizing English Variation across Social Media Communities with BERT , 2021, Transactions of the Association for Computational Linguistics.
[68] Richard Frank,et al. Upvoting extremism: Collective identity formation and the extreme right on Reddit , 2020, New Media Soc..
[69] Dong Nguyen,et al. On learning and representing social meaning in NLP: a sociolinguistic perspective , 2021, NAACL.
[70] Su Lin Blodgett. Sociolinguistically Driven Approaches for Just Natural Language Processing , 2021 .
[71] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[72] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[73] Diyi Yang,et al. The Importance of Modeling Social Factors of Language: Theory and Practice , 2021, NAACL.
[74] Dmytro Okhonko,et al. HTLM: Hyper-Text Pre-Training and Prompting of Language Models , 2021, ICLR.
[75] Courtney Heldreth,et al. “I don’t Think These Devices are Very Culturally Sensitive.”—Impact of Automated Speech Recognition Errors on African Americans , 2021, Frontiers in Artificial Intelligence.
[76] K. Coussement,et al. What makes people share political content on social media? The role of emotion, authority and ideology , 2021, Comput. Hum. Behav..