论文信息 - Adjusting for Confounding with Text Matching

Adjusting for Confounding with Text Matching

We identify situations in which conditioning on text can address confounding in observational studies. We argue that a matching approach is particularly well-suited to this task, but existing matching methods are ill-equipped to handle high-dimensional text data. Our proposed solution is to estimate a low-dimensional summary of the text and condition on this summary via matching. We propose a method of text matching, topical inverse regression matching, that allows the analyst to match both on the topical content of confounding documents and the probability that each of these documents is treated. We validate our approach and illustrate the importance of conditioning on text to address confounding with two applications: the effect of perceptions of author gender on citation counts in the international relations literature and the effects of censorship on Chinese social media users. Verification Materials: The materials required to verify the computational reproducibility of the results, procedures, and analyses in this article are available on the American Journal of Political Science Dataverse within the Harvard Dataverse Network, at: https://doi.org/10.7910/DVN/HTMX3K. Social media users in China are censored every day, but it is largely unknown how the experience of being censored affects their future online experience. Are social media users who are censored for the first time flagged by censors for increased scrutiny in the future? Is censorship “targeted” and “customized” toward specific users? Do social media users avoid writing after being censored? Do they continue to write on sensitive topics or do they avoid them? Experimentally manipulating censorship would allow us to make credible causal inferences about the effects of experiencing censorship, but this is impractical Margaret E. Roberts is Associate Professor, Department of Political Science, University of California, San Diego, Social Sciences Building 301, 9500 Gilman Drive, #0521, La Jolla, CA 92093-0521 (meroberts@ucsd.edu). Brandon M. Stewart is Assistant Professor and Arthur H. Scribner Bicentennial Preceptor, Department of Sociology, Princeton University, 149 Wallace Hall, Princeton, NJ 08544 (bms4@princeton.edu). Richard A. Nielsen is Associate Professor, Department of Political Science, Massachusetts Institute for Technology, 77 Massachusetts Avenue, E53 Room 455, Cambridge, MA 02139 (rnielsen@mit.edu). We thank the following for helpful comments and suggestions on this work: David Blei, Naoki Egami, Chris Felton, James Fowler, Justin Grimmer, Erin Hartman, Chad Hazlett, Seth Hill, Kosuke Imai, Rebecca Johnson, Gary King, Adeline Lo, Will Lowe, Chris Lucas, Walter Mebane, David Mimno, Jennifer Pan, Marc Ratkovic, Matt Salganik, Caroline Tolbert, and Simone Zhang; audiences at the Princeton Text Analysis Workshop, Princeton Politics Methods Workshop, the University of Rochester, Microsoft Research, the Text as Data Conference, and the Political Methodology Society and the Visions in Methodology conference; and some tremendously helpful anonymous reviewers. We especially thank Dustin Tingley for numerous insightful conversations on the connections between STM and causal inference and Ian Lundberg for extended discussions on some technical details. Dan Maliniak, Ryan Powers, and Barbara Walter graciously supplied data and replication code for the gender and citations study. The JSTOR Data for Research program provided academic journal data for the international relations application. This research was supported, in part, by the Eunice Kennedy Shriver National Institute of Child Health and Human Development under grant P2-CHD047879 to the Office of Population Research at Princeton University. The research was also supported by grants from the National Science Foundation RIDIR program, award numbers 1738411 and 1738288. This publication was made possible, in part, by a grant from the Carnegie Corporation of New York, supporting Richard Nielsen as an Andrew Carnegie Fellow. The statements made and views expressed are solely the responsibility of the authors. and unethical outside of a lab setting. Inferring causal effects in observational settings is challenging due to confounding. The types of users who are censored might have different opinions that drive them to write differently than the types of users who are not censored. This in turn might affect both the users’ rate of censorship as well as future behavior and outcomes. We argue that conditioning on the text of censored social media posts and other user-level characteristics can substantially decrease or eliminate confounding and allow credible causal inferences with observational data. Intuitively, if we can find nearly identical posts—one of which is censored while the American Journal of Political Science, Vol. 64, No. 4, October 2020, Pp. 887–903 C ©2020, Midwest Political Science Association DOI: 10.1111/ajps.12526

Margaret E. Roberts | Brandon M. Stewart | Richard A. Nielsen | Brandon M Stewart | Richard A. Nielsen