The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data

This article presents the Nordic Tweet Stream (NTS), a cross-disciplinary corpus project of computer scientists and a group of sociolinguists interested in language variability and in the global spread of English. Our research integrates two types of empirical data: We not only rely on traditional structured corpus data but also use unstructured data sources that are often big and rich in metadata, such as Twitter streams. The NTS downloads tweets and associated metadata from Denmark, Finland, Iceland, Norway and Sweden. We first introduce some technical aspects in creating a dynamic real-time monitor corpus, and the following case study illustrates how the corpus could be used as empirical evidence in sociolinguistic studies focusing on the global spread of English to multilingual settings. The results show that English is the most frequently used language, accounting for almost a third. These results can be used to assess how widespread English use is in the Nordic region and offer a big data perspective that complement previous small-scale studies. The future objectives include annotating the material, making it available for the scholarly community, and expanding the geographic scope of the data stream outside Nordic region.

[1]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[2]  Kaisa S. Pietikäinen ELF in social contexts , 2017 .

[3]  Adrien Barbaresi Collection and Indexing of Tweets with a Geographical Focus , 2016, LREC 2016.

[4]  Panagiotis Takis Metaxas,et al.  Limits of Electoral Predictions Using Twitter , 2011, ICWSM.

[5]  Christiane Meierkord,et al.  English in contemporary Sweden: Perceptions, policies, and narrated practices† , 2013 .

[6]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[7]  Lauren Squires,et al.  Grammatical feature frequencies of English on Twitter in Finland , 2016 .

[8]  Andreas Kerren,et al.  StanceXplore: Visualization for the Interactive Exploration of Stance in Social Media , 2017 .

[9]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[10]  Svenja Adolphs,et al.  CANELC: constructing an e-language corpus , 2014 .

[11]  Sali A. Tagliamonte,et al.  Variationist Sociolinguistics: Change, Observation, Interpretation , 2011 .

[12]  Tatjana Scheffler,et al.  A German Twitter Snapshot , 2014, LREC.

[13]  Jonas Lundberg,et al.  Revisiting weak ties : Using present-day social media data in variationist studies , 2017 .

[14]  Diansheng Guo,et al.  Understanding U.S. regional linguistic variation with Twitter data analysis , 2016, Comput. Environ. Urban Syst..

[15]  Ray Carey,et al.  New answers to familiar questions: English as a lingua franca , 2015 .

[16]  Scott A. Hale,et al.  Where in the World Are You? Geolocation and Language Identification in Twitter* , 2013, ArXiv.

[17]  Christian Mair,et al.  The World System of Englishes: Accounting for the transnational importance of mobile and mediated vernaculars , 2013 .

[18]  Steven Coats Gender and lexical type frequencies in Finland Twitter English , 2017 .

[19]  Hsing-Wen Wang,et al.  Exploring the Impacts of Social Networking on Brand Image and Purchase Intention in Cyberspace , 2015, J. Univers. Comput. Sci..

[20]  Jesús García Laborda,et al.  Looking towards the Future of Language Assessment: Usability of Tablet PCs in Language Testing , 2016, J. Univers. Comput. Sci..

[21]  Jonas Lundberg,et al.  On-the-fly Detection of Autogenerated Tweets , 2018, ArXiv.

[22]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[23]  Linda Bradley,et al.  The Mobile Language Learner - Use of Technology in Language Learning , 2015, J. Univers. Comput. Sci..

[24]  Kari Nissinen,et al.  National Survey on the English Language in Finland : Uses, Meanings and Attitudes , 2011 .