Weakly Supervised User Profile Extraction from Twitter

While user attribute extraction on social media has received considerable attention, existing approaches, mostly supervised, encounter great difficulty in obtaining gold standard data and are therefore limited to predicting unary predicates (e.g., gender). In this paper, we present a weaklysupervised approach to user profile extraction from Twitter. Users’ profiles from social media websites such as Facebook or Google Plus are used as a distant source of supervision for extraction of their attributes from user-generated text. In addition to traditional linguistic features used in distant supervision for information extraction, our approach also takes into account network information, a unique opportunity offered by social media. We test our algorithm on three attribute domains: spouse, education and job; experimental results demonstrate our approach is able to make accurate predictions for users’ attributes based on their tweets. 1

[1]  Karin M. Verspoor,et al.  Literature mining of protein-residue associations with graph rules learned through distant supervision , 2012, J. Biomed. Semant..

[2]  Krishna P. Gummadi,et al.  You are who you know: inferring user profiles in online social networks , 2010, WSDM '10.

[3]  Faiyaz Al Zamal,et al.  Using Social Media to Infer Gender Composition of Commuter Populations , 2012, Proceedings of the International AAAI Conference on Web and Social Media.

[4]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[5]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[6]  Gottfried Vossen,et al.  The World Wide Web and Databases , 2001, Lecture Notes in Computer Science.

[7]  David Carmel,et al.  Social media recommendation based on people and tags , 2010, SIGIR.

[8]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[9]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[10]  Dan Klein,et al.  Structure compilation: trading structure for features , 2008, ICML '08.

[11]  Regina Barzilay,et al.  Event Discovery in Social Media Feeds , 2011, ACL.

[12]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[13]  Jure Leskovec,et al.  Overlapping community detection at scale: a nonnegative matrix factorization approach , 2013, WSDM.

[14]  Jacob Ratkiewicz,et al.  Political Polarization on Twitter , 2011, ICWSM.

[15]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[16]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[17]  Hiroshi Nakagawa,et al.  Reducing Wrong Labels in Distant Supervision for Relation Extraction , 2012, ACL.

[18]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[19]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[20]  Oren Etzioni,et al.  Modeling Missing Data in Distant Supervision for Information Extraction , 2013, TACL.

[21]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[22]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[23]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[24]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[25]  D. Rao Detecting Latent User Properties in Social Media , 2010 .

[26]  David Yarowsky,et al.  Hierarchical Bayesian Models for Latent Attribute Detection in Social Media , 2011, ICWSM.

[27]  Charles A. Sutton,et al.  Word storms: multiples of word clouds for visual comparison of documents , 2013, WWW.

[28]  Ramesh Nallapati,et al.  Multi-instance Multi-label Learning for Relation Extraction , 2012, EMNLP.

[29]  Claire Cardie,et al.  Timeline generation: tracking individuals on twitter , 2013, WWW.

[30]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.