Writing Across the World's Languages: Deep Internationalization for Gboard, the Google Keyboard

This technical report describes our deep internationalization program for Gboard, the Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing systems, and this report describes how and why we have been adding support for hundreds of language varieties from around the globe. Many languages of the world are increasingly used in writing on an everyday basis, and we describe the trends we see. We cover technological and logistical challenges in scaling up a language technology product like Gboard to hundreds of language varieties, and describe how we built systems and processes to operate at scale. Finally, we summarize the key take-aways from user studies we ran with speakers of hundreds of languages from around the world.

[1]  András Kornai Digital language death , 2013 .

[2]  Daan van Esch,et al.  Mining Training Data for Language Modeling Across the World's Languages , 2018, SLTU.

[3]  Inge Kral,et al.  Plugged in: Remote Australian Indigenous Youth and Digital Culture , 2010 .

[4]  S. McMonagle Aspects of language choice online among German-Upper Sorbian bilingual adolescents , 2019, International Journal of Bilingual Education and Bilingualism.

[5]  Jesús M. González-Barahona,et al.  A Preliminary Analysis of Localization in Free Software: How Translations Are Performed , 2013, OSS.

[6]  Muhammad Abdul-Mageed,et al.  You Tweet What You Speak: A City-Level Dataset of Arabic Dialects , 2018, LREC.

[7]  Derek Lackaff,et al.  Local languages, global networks: Mobile design for minority language users , 2016, SIGDOC.

[8]  Brook Danielle Lillehaugen,et al.  Why write in a language that (almost) no one can read? Twitter and the development of written literature , 2016 .

[9]  Daniel Cunliffe,et al.  What can hashtags tell us about minority languages on Twitter? A comparison of #cymraeg, #frysk, and #gaeilge , 2019 .

[10]  Jacob Eisenstein,et al.  Identifying Regional Dialects in On‐Line Social Media , 2017 .

[11]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[12]  T. Keegan,et al.  Using Twitter in an Indigenous Language: An analysis of te reo Māori tweets , 2015 .

[13]  Nor Edzan Binti Che Nasir Digitisation of an Endangered Written Language: The Case of the Jawi Script , 2001 .

[14]  Lameen Souag Ajami in West Africa , 2011 .

[15]  Developing Orthographies for Unwritten Languages . Publications in Language Use and Education 6 , 2022 .

[16]  Cyril Allauzen,et al.  Federated Learning of N-Gram Language Models , 2019, CoNLL.

[17]  D. Nguyen Dialect Variation on Social Media , 2021 .

[18]  Gerardo Sierra,et al.  Challenges of language technologies for the indigenous languages of the Americas , 2018, COLING.

[19]  Enrique Uribe-Jongbloed,et al.  Social Media and Minority Languages: Convergence and the Creative Industries , 2013 .

[20]  C. F. Hockett,et al.  The World's Writing Systems , 1997 .

[21]  Brian Roark,et al.  Latin script keyboards for South Asian languages with finite-state normalization , 2019, FSMNLP.

[22]  Damien Mooney,et al.  Creating Orthographies for Endangered Languages , 2017 .

[23]  Claudia Soria,et al.  The DLDP Survey on Digital Use and Usability of EU Regional and Minority Languages , 2018, LREC.

[24]  Michael Carrier,et al.  Because Internet: Understanding the new rules of language (a review) , 2019, Training Language and Culture.

[25]  Hubert Eichner,et al.  Federated Learning for Mobile Keyboard Prediction , 2018, ArXiv.

[26]  Christina Willis Oko,et al.  Orthography development for Darma (The case that wasn’t) , 2018 .

[27]  L. Cornips,et al.  Regional languages on Twitter A comparative study between Frisian and Limburgish , 2017 .

[28]  Hugh Paterson,et al.  Endangered Languages and New Technologies: Keyboard layouts: Lessons from the Meꞌphaa and Sochiapam Chinantec designs , 2014 .

[29]  J. Cru Language revitalisation from the ground up: promoting Yucatec Maya on Facebook , 2015 .

[30]  H. V. D. Velde,et al.  Language use of Frisian bilingual teenagers on social media , 2016 .

[31]  P. Lewis Ethnologue : languages of the world , 2009 .

[32]  Martin Haspelmath The last word on polysynthesis: A review article , 2018, Linguistic Typology.

[33]  W. Bright,et al.  The World's Writing Systems , 1997 .

[34]  Inge Kral,et al.  Talk, Text and Technology: Literacy and Social Practice in a Remote Indigenous Community , 2012 .

[35]  Carolyn Logan 800 Languages and Counting , 2018 .

[36]  C. Jany The role of new technology and social media in reversing language loss , 2017 .

[37]  Shumin Zhai,et al.  Effects of Language Modeling and its Personalization on Touchscreen Typing Performance , 2015, CHI.

[38]  Tom Ouyang,et al.  Mobile Keyboard Input Decoding with Finite-State Transducers , 2017, ArXiv.

[39]  Carmen Brandt,et al.  4. Script as a potential demarcator and stabilizer of languages in South Asia , 2014 .

[40]  Brian Roark,et al.  Distributed representation and estimation of WFST-based n-gram models , 2016 .

[41]  J. Blommaert Grassroots Literacy: Writing, Identity and Voice in Central Africa , 2008 .

[42]  Anna Kazantseva,et al.  Indigenous language technologies in Canada: Assessment, challenges, and successes , 2018, COLING.

[43]  Brian Roark,et al.  Transliterated Mobile Keyboard Input via Weighted Finite-State Transducers , 2017, FSMNLP.

[44]  Don Osborn,et al.  African Languages in a Digital Age: Challenges and Opportunities for Indigenous Language Computing , 2010 .

[45]  Dong Nguyen,et al.  Audience and the Use of Minority Languages on Twitter , 2015, ICWSM.

[46]  Daan van Esch,et al.  Automatic Keyboard Layout Design for Low-Resource Latin-Script Languages , 2019, ArXiv.

[47]  Daan van Esch,et al.  Text Normalization Infrastructure that Scales to Hundreds of Language Varieties , 2018, LREC.