Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case

Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced languages, as it allows to apply current natural language processing techniques to large amounts of unstructured data. In this work, we study the linguistic and social aspects of young and adult people’s behaviour based on their tweets’ contents and the social relations that arise from them. With this objective in mind, we have gathered over 10 million tweets from more than 8000 users. First, we classified each user in terms of its life stage (young/adult) according to the writing style of their tweets. Second, we applied topic modelling techniques to the personal tweets to find the most popular topics according to life stages. Third, we established the relations and communities that emerge based on the retweets. We conclude that using large amounts of unstructured data provided by Twitter facilitates social research using computational techniques such as natural language processing, giving the opportunity both to segment communities based on demographic characteristics and to discover how they interact or relate to them.

[1]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[2]  Arkaitz Zubiaga,et al.  Real‐time classification of Twitter trends , 2014, J. Assoc. Inf. Sci. Technol..

[3]  German Rigau,et al.  Language independent sequence labelling for Opinion Target Extraction , 2019, Artif. Intell..

[4]  Carolyn Penstein Rosé,et al.  Computational Sociolinguistics: A Survey , 2016, Computational Linguistics.

[5]  M. Jacomy,et al.  ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software , 2014, PloS one.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Arkaitz Zubiaga,et al.  TweetNorm: a benchmark for lexical normalization of Spanish tweets , 2015, Lang. Resour. Evaluation.

[8]  Kleanthes K. Grohmann,et al.  Eliciting Big Data From Small, Young, or Non-standard Languages: 10 Experimental Challenges , 2019, Front. Psychol..

[9]  Robert F. Chew,et al.  Predicting age groups of Twitter users based on language and metadata features , 2017, PloS one.

[10]  Iñaki Alegria,et al.  From language identification to language distance , 2017 .

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[13]  Arkaitz Zubiaga,et al.  TweetLID: a benchmark for tweet language identification , 2016, Lang. Resour. Evaluation.

[14]  German Rigau,et al.  Robust multilingual Named Entity Recognition with shallow semi-supervised features , 2016, Artif. Intell..

[15]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.