Learning to Recognize Dialect Features

Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in “He ∅ running”. In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.

[1]  John Nerbonne,et al.  Data-driven Dialectology , 2008 .

[2]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[3]  Ian Stewart,et al.  Now We Stronger than Ever: African-American English Syntax in Twitter , 2014, EACL.

[4]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[5]  William L. Hamilton,et al.  Language from police body camera footage shows racial disparities in officer respect , 2017, Proceedings of the National Academy of Sciences.

[6]  Jeffrey T. Grogger,et al.  The Wage Penalty of Regional Accents , 2020, SSRN Electronic Journal.

[7]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[8]  Dirk Hovy,et al.  Challenges of studying and processing dialects in social media , 2015, NUT@IJCNLP.

[9]  Lynnelle Rhinier Brown,et al.  Requesting the Context: A Context Analysis of Let Statement and If Statement Requests and Commands in the Santa Barbara Corpus of Spoken American English , 2014 .

[10]  Jonathan Dunn,et al.  Modeling Global Syntactic Variation in English Using Dialect Classification , 2019, Proceedings of the Sixth Workshop on.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Zhou Yu,et al.  ALICE: Active Learning with Contrastive Natural Language Explanations , 2020, EMNLP.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.

[15]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[16]  Sidney Greenbaum,et al.  The International Corpus of English (ICE) Project , 1996 .

[17]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[18]  Claudia Lange The Syntax of Spoken Indian English , 2012 .

[19]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[22]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[23]  Taylor Jones Toward a Description of African American Vernacular English Dialect Regions Using “Black Twitter” , 2015 .

[24]  Janneke Van Hofwegen,et al.  Coming of age in African American English: A longitudinal study† , 2010 .

[25]  Thomas Lukasiewicz,et al.  A Surprisingly Robust Trick for the Winograd Schema Challenge , 2019, ACL.

[26]  Roger Levy,et al.  Neural language models as psycholinguistic subjects: Representations of syntactic state , 2019, NAACL.

[27]  S. Benor,et al.  Ethnolinguistic repertoire: Shifting the analytic focus in language and ethnicity , 2010 .

[28]  Dirk Hovy,et al.  User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , 2015, WWW.

[29]  Gabriel Bernier-Colborne,et al.  Improving Cuneiform Language Identification with BERT , 2019, Proceedings of the Sixth Workshop on.

[30]  Axel Bohmann,et al.  Variation in English Worldwide , 2019 .

[31]  H. Craig,et al.  Oral Language Expectations for African American Preschoolers and Kindergartners , 2002 .

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[34]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[35]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[36]  Charlotte Gooskens,et al.  Gabmap – A web application for dialectology. , 2011 .

[37]  Jonathan Dunn,et al.  Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs , 2018, ArXiv.

[38]  Devyani Sharma,et al.  Typological diversity in New Englishes , 2009 .

[39]  Neil A. Macmillan,et al.  Detection Theory: A User's Guide , 1991 .

[40]  R. K. Agnihotri,et al.  Indian English Phonology: A Sociolinguistic Perspective , 1988 .

[41]  Yuxing Chen,et al.  Harnessing the linguistic signal to predict scalar inferences , 2019, ACL.

[42]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[43]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.