Learning from noisy label proportions for classifying online social data

Inferring latent attributes (e.g., demographics) of social media users is important to improve the accuracy and validity of social media analysis methods. While most existing approaches use either heuristics or supervised classification, recent work has shown that accurate classification models can be trained using supervision from population statistics. These learning with label proportion (LLP) models are fit on bags of instances and then applied to individual accounts. However, it is well known that many social media sites such as Twitter are not a representative sample of the population; thus, there are many sources of noise in these label proportions (e.g., sampling bias). This can in turn degrade the quality of the resulting model. In this paper, we investigate classification algorithms that use population statistical constraints such as demographics, names, and social network followers to fit classifiers to predict individual user attributes. We propose LLP methods that explicitly model the noise inherent in these label proportions. On several real and synthetic datasets, we find that combining these enhancements together can significantly reduce averaged classification error by 7%, resulting in methods that are robust to noise in the provided label proportions.

[1]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[2]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[3]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[4]  S. Watkins The Young and the Digital: What the Migration to Social Network Sites, Games, and Anytime, Anywhere Media Means for Our Future , 2009 .

[5]  Jacob Ratkiewicz,et al.  Predicting the Political Alignment of Twitter Users , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[6]  Mark Dredze,et al.  How Social Media Will Change Public Health , 2012, IEEE Intelligent Systems.

[7]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[8]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[9]  David R. Musicant,et al.  Supervised Learning by Training on Aggregate Outputs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[10]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[11]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[12]  R. M. Alvarez Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data , 2014 .

[13]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[14]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[15]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[16]  Lakshman Krishnamurthi,et al.  Investigating the Relationship Between the Content of Online Word of Mouth, Advertising, and Brand Performance , 2014, Mark. Sci..

[17]  J. Pennebaker,et al.  LEXICAL PREDICTORS OFPERSONALITY TYPE , 2005 .

[18]  Domonkos Tikk,et al.  Investigation of Various Matrix Factorization Methods for Large Recommender Systems , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[19]  Aron Culotta,et al.  Predicting Twitter User Demographics using Distant Supervision from Website Traffic Data , 2016, J. Artif. Intell. Res..

[20]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[21]  Lars Backstrom,et al.  ePluribus: Ethnicity on Social Networks , 2010, ICWSM.

[22]  Qiang Ji,et al.  Learning with Target Prior , 2012, NIPS.

[23]  Christopher D. Manning,et al.  Robust Logistic Regression using Shift Parameters , 2013, ACL.

[24]  Michael Gamon,et al.  Online and Social Media Data As an Imperfect Continuous Panel Survey , 2016, PloS one.

[25]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[26]  Mark Dredze,et al.  Demographer: Extremely Simple Name Demographics , 2016, NLP+CSS@EMNLP.

[27]  Lars Schmidt-Thieme,et al.  Online-updating regularized kernel matrix factorization models for large-scale recommender systems , 2008, RecSys '08.

[28]  Shou-De Lin,et al.  A Content-Based Matrix Factorization Model for Recipe Recommendation , 2014, PAKDD.

[29]  Megha Agrawal,et al.  Characterizing Geographic Variation in Well-Being Using Tweets , 2013, ICWSM.

[30]  David M. Mount,et al.  Analysis of approximate nearest neighbor searching with clustered point sets , 1999, Data Structures, Near Neighbor Searches, and Methodology.

[31]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[32]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[33]  Lutz Prechelt,et al.  Early Stopping-But When? , 1996, Neural Networks: Tricks of the Trade.

[34]  Robert E. Schapire,et al.  Incorporating Prior Knowledge into Boosting , 2002, ICML.

[35]  Svitlana Volkova,et al.  Online Bayesian Models for Personal Analytics in Social Media , 2015, AAAI.

[36]  Fillia Makedon,et al.  Learning from Incomplete Ratings Using Non-negative Matrix Factorization , 2006, SDM.

[37]  Yi Liu,et al.  A Framework for Incorporating Class Priors into Discriminative Classification , 2005, PAKDD.

[38]  Ming-Wei Chang,et al.  Guiding Semi-Supervision with Constraint-Driven Learning , 2007, ACL.

[39]  David Yarowsky,et al.  Hierarchical Bayesian Models for Latent Attribute Detection in Social Media , 2011, ICWSM.

[40]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[41]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[42]  Julio Gonzalo,et al.  Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems , 2013, CLEF.

[43]  Ming-Wei Chang,et al.  Structured learning with constrained conditional models , 2012, Machine Learning.

[44]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[45]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[46]  Susannah Fox,et al.  Twitter and status updating , 2009 .

[47]  Alexander J. Smola,et al.  Estimating Labels from Label Proportions , 2009, J. Mach. Learn. Res..

[48]  Ning Chen,et al.  Bayesian inference with posterior regularization and applications to infinite latent SVMs , 2012, J. Mach. Learn. Res..

[49]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[50]  Aron Culotta,et al.  Domain Adaptation for Learning from Label Proportions Using Self-Training , 2016, IJCAI.

[51]  Amin Mantrach,et al.  Item cold-start recommendations: learning local collective embeddings , 2014, RecSys '14.

[52]  Aron Culotta,et al.  Inferring latent attributes of Twitter users with label regularization , 2015, NAACL.

[53]  Nikolaos Aletras,et al.  An analysis of the user occupational class through Twitter content , 2015, ACL.

[54]  David Kamerer Estimating online audiences: Understanding the limitations of competitive intelligence services , 2013, First Monday.

[55]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[56]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[57]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[58]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[59]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[60]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[61]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[62]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[63]  Hüseyin Oktay,et al.  Demographic Breakdown of Twitter Users: An analysis based on names , 2014 .

[64]  Geoffrey I. Webb,et al.  Advances in Knowledge Discovery and Data Mining , 2018, Lecture Notes in Computer Science.

[65]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[66]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.