Fluency detection on communication networks

When considering a social media corpus, we often have access to structural information about how messages are flowing between people or organizations. This information is particularly useful when the linguistic evidence is sparse, incomplete, or of dubious quality. In this paper we construct a simple model to leverage the structure of Twitter data to help determine the set of languages each user is fluent in. Our results demonstrate that imposing several intuitive constraints leads to improvements in performance and stability. We release the first annotated data set for exploring this task, and discuss how our approach may be extended to other applications.

[1]  Jimmy J. Lin,et al.  Information network or social network?: the structure of the twitter follow graph , 2014, WWW.

[2]  Dirk Hovy,et al.  Demographic Factors Improve Classification Performance , 2015, ACL.

[3]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[4]  Jure Leskovec,et al.  Exploiting Social Network Structure for Person-to-Person Sentiment Analysis , 2014, TACL.

[5]  Arnaud Legout,et al.  Studying social networks at scale: macroscopic anatomy of the twitter social graph , 2014, SIGMETRICS '14.

[6]  Ana-Maria Popescu,et al.  Democrats, republicans and starbucks afficionados: user classification in twitter , 2011, KDD.

[7]  Shaowen Wang,et al.  Mapping the global Twitter heartbeat: The geography of Twitter , 2013, First Monday.

[8]  Johannes Bisschop,et al.  AIMMS - Optimization Modeling , 2006 .

[9]  Dragomir R. Radev,et al.  Experiments in Sentence Language Identification with Groups of Similar Languages , 2014, VarDial@COLING.

[10]  Wouter Weerkamp,et al.  Microblog language identification: overcoming the limitations of short, unedited and idiomatic text , 2012, Language Resources and Evaluation.

[11]  Carey E. Priebe,et al.  Bayesian Vertex Nomination Using Content and Context , 2015 .

[12]  Pablo Barberá Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data , 2015, Political Analysis.

[13]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[14]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[15]  Guodong Zhou,et al.  Interactive Gender Inference with Integer Linear Programming , 2015, IJCAI.

[16]  Benjamin Van Durme,et al.  I'm a Belieber: Social Roles via Self-identification and Conceptual Attributes , 2014, ACL.

[17]  Svitlana Volkova,et al.  Inferring User Political Preferences from Streaming Communications , 2014, ACL.