Feature Selection as Causal Inference: Experiments with Text Classification

This paper proposes a matching technique for learning causal associations between word features and class labels in document classification. The goal is to identify more meaningful and generalizable features than with only correlational approaches. Experiments with sentiment classification show that the proposed method identifies interpretable word associations with sentiment and improves classification performance in a majority of cases. The proposed feature selection method is particularly effective when applied to out-of-domain data.

[1]  Jacob Eisenstein,et al.  Emoticons vs. Emojis on Twitter: A Causal Inference Approach , 2015, ArXiv.

[2]  T. Shakespeare,et al.  Observational Studies , 2003 .

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[5]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[6]  Michael J Paul,et al.  Characterizing the (Perceived) Newsworthiness of Health Science Articles: A Data-Driven Approach , 2016, JMIR medical informatics.

[7]  Paul R. Rosenbaum,et al.  Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms , 1993 .

[8]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[9]  Virgile Landeiro,et al.  Robust Text Classification in the Presence of Confounding Bias , 2016, AAAI.

[10]  Aron Culotta,et al.  Using matched samples to estimate the effects of exercise on mental health from twitter , 2015, AAAI 2015.

[11]  Andrew McCallum,et al.  Active Learning by Labeling Features , 2009, EMNLP.

[12]  Christine D. Piatko,et al.  Using “Annotator Rationales” to Improve Machine Learning for Text Categorization , 2007, NAACL.

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  P. Austin,et al.  Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies , 2010, Pharmaceutical statistics.

[15]  Michael J. Paul Interpretable Machine Learning : Lessons from Topic Modeling , 2016 .

[16]  D. Rubin,et al.  Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score , 1985 .

[17]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[18]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[19]  Noah A. Smith,et al.  Contrastive Estimation: Training Log-Linear Models on Unlabeled Data , 2005, ACL.

[20]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[21]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[22]  Munmun De Choudhury,et al.  The Language of Social Support in Social Media and Its Effect on Suicidal Ideation Risk , 2017, ICWSM.

[23]  W. G. Cochran The effectiveness of adjustment by subclassification in removing bias in observational studies. , 1968, Biometrics.

[24]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[25]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[26]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[27]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  Mark Dredze,et al.  A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews , 2014, J. Am. Medical Informatics Assoc..

[30]  G. Cawley Causal & non-causal feature selection for ridge regression , 2008 .

[31]  Jure Leskovec,et al.  Antisocial Behavior in Online Discussion Communities , 2015, ICWSM.

[32]  Constantin F. Aliferis,et al.  Causal Feature Selection , 2007 .

[33]  Bo Pang,et al.  The effect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter , 2014, ACL.

[34]  Jason Eisner,et al.  Modeling Annotators: A Generative Approach to Learning from Annotator Rationales , 2008, EMNLP.

[35]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[36]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[37]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions , 2010, J. Mach. Learn. Res..

[38]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[39]  Noah A. Smith,et al.  Linguistic Structured Sparsity in Text Categorization , 2014, ACL.