Measuring Linguistic Diversity During COVID-19

Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-referenced social media and web data. The goal, however, has been to describe these corpora themselves rather than to make inferences about underlying populations. This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora that is introduced by non-local populations. These methods tell us where significant changes have taken place and whether this leads to increased or decreased diversity. This is an important step in aligning digital corpora like social media with the real-world populations that have produced them.

[1]  Jonathan Dunn,et al.  Modeling Global Syntactic Variation in English Using Dialect Classification , 2019, Proceedings of the Sixth Workshop on.

[2]  Jonathan Dunn,et al.  Mapping languages: the Corpus of Global Language Use , 2020, Lang. Resour. Evaluation.

[3]  C. Hall,et al.  Pandemics, tourism and global change: a rapid assessment of COVID-19 , 2020 .

[4]  Diansheng Guo,et al.  Mapping Lexical Dialect Variation in British English Using Twitter , 2019, Front. Artif. Intell..

[5]  Maxime Lenormand,et al.  Immigrant community integration in world cities , 2016, PloS one.

[6]  Krishna P. Gummadi,et al.  Geographic Dissection of the Twitter Network , 2012, ICWSM.

[7]  David Card,et al.  Minimum Wages and Employment: A Case Study of the Fast Food Industry in New Jersey and Pennsylvania , 1993 .

[8]  Laurel J. Brinton,et al.  Building and evaluating web corpora representing national varieties of English , 2017, Lang. Resour. Evaluation.

[9]  Alessandro Vespignani,et al.  The Twitter of Babel: Mapping World Languages through Microblogging Platforms , 2012, PloS one.

[10]  Johannes Schöning,et al.  The Geography and Importance of Localness in Geotagged Social Media , 2016, CHI.

[11]  Derek Lackaff,et al.  Local languages, global networks: Mobile design for minority language users , 2016, SIGDOC.

[12]  Jonathan Dunn,et al.  Global Syntactic Variation in Seven Languages: Toward a Computational Dialectology , 2019, Front. Artif. Intell..

[13]  J Dunn,et al.  Mapping Languages and Demographics with Georeferenced Corpora , 2020, ArXiv.

[14]  Jonathan Dunn,et al.  Geographically-Balanced Gigaword Corpora for 50 Language Varieties , 2020, LREC.

[15]  P. Lewis Ethnologue : languages of the world , 2009 .

[16]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[17]  A. Hirschman National Power and the Structure of Foreign Trade , 2024 .

[18]  Bruno Gonçalves,et al.  Crowdsourcing Dialect Characterization through Twitter , 2014, PloS one.

[19]  Yulia Tsvetkov,et al.  Incorporating Dialectal Variability for Socially Equitable Language Identification , 2017, ACL.

[20]  Scott A. Hale,et al.  Where in the World Are You? Geolocation and Language Identification in Twitter* , 2013, ArXiv.

[21]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[22]  Matthew Crosby,et al.  Association for the Advancement of Artificial Intelligence , 2014 .