Content Analysis of Textbooks via Natural Language Processing: Findings on Gender, Race, and Ethnicity in Texas U.S. History Textbooks

Cutting-edge data science techniques can shed new light on fundamental questions in educational research. We apply techniques from natural language processing (lexicons, word embeddings, topic models) to 15 U.S. history textbooks widely used in Texas between 2015 and 2017, studying their depiction of historically marginalized groups. We find that Latinx people are rarely discussed, and the most common famous figures are nearly all White men. Lexicon-based approaches show that Black people are described as performing actions associated with low agency and power. Word embeddings reveal that women tend to be discussed in the contexts of work and the home. Topic modeling highlights the higher prominence of political topics compared with social ones. We also find that more conservative counties tend to purchase textbooks with less representation of women and Black people. Building on a rich tradition of textbook analysis, we release our computational toolkit to support new research directions.

[1]  H. W. Castner Lies My Teacher Told Me: Everything Your American History Textbook Got Wrong , 2004 .

[2]  David García,et al.  It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia , 2015, ICWSM.

[3]  Yejin Choi,et al.  Connotation Frames: A Data-Driven Investigation , 2015, ACL.

[4]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[5]  Emre Kıcıman,et al.  Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries , 2018, Front. Big Data.

[6]  Noah A. Smith,et al.  Analyzing Framing through the Casts of Characters in the News , 2016, EMNLP.

[7]  James W. Fraser By the People: A History of the United States , 2014 .

[8]  Justin Reich,et al.  Computer-Assisted Reading and Discovery for Student Generated Text in Massive Open Online Courses , 2014, J. Learn. Anal..

[9]  Danielle S McNamara,et al.  The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion , 2015, Behavior Research Methods.

[10]  S. Schmidt Am I a woman? The normalisation of woman in US History , 2012 .

[11]  Yejin Choi,et al.  Connotation Frames of Power and Agency in Modern Films , 2017, EMNLP.

[12]  Dan Klein,et al.  Easy Victories and Uphill Battles in Coreference Resolution , 2013, EMNLP.

[13]  L. Sherrod,et al.  Citizenship and Education in Twenty-Eight Countries: Civic Knowledge and Engagement at Age Fourteen , 2003 .

[14]  J. M. Kittross The measurement of meaning , 1959 .

[15]  Christopher D. Manning,et al.  Deep Reinforcement Learning for Mention-Ranking Coreference Models , 2016, EMNLP.

[16]  Brendan T. O'Connor,et al.  Learning Latent Personas of Film Characters , 2013, ACL.

[17]  Arthur C. Graesser,et al.  Coh-Metrix Measures Text Characteristics at Multiple Levels of Language and Discourse , 2014, The Elementary School Journal.

[18]  John W. Meyer,et al.  The rise of individual agency in conceptions of society: Textbooks worldwide, 1950–2011 , 2017 .

[19]  Michael Trucano,et al.  Getting Textbooks to Every Child in Sub-Saharan Africa: Strategies for Addressing the High Cost and Low Availability Problem , 2015 .

[20]  Markus Strohmaier,et al.  (Don't) Mention the War: A Comparison of Wikipedia and Britannica Articles on National Histories , 2018, WWW.

[21]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[22]  Scott Crossley,et al.  The tool for the automatic analysis of lexical sophistication (TAALES): version 2.0 , 2017, Behavior Research Methods.

[23]  Brent J. Evans,et al.  Text as Data Methods for Education Research , 2019, Journal of Research on Educational Effectiveness.

[24]  Saif Mohammad,et al.  Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words , 2018, ACL.

[25]  Timothy Dozat,et al.  Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task , 2017, CoNLL.

[26]  Daniel A. McFarland,et al.  Paradigm Wars Revisited: A Cartography of Graduate Research in the Field of Education (1980–2010) , 2020, American Educational Research Journal.

[27]  M. Apple The Text and Cultural Politics , 1992 .

[28]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[29]  Maria Liakata,et al.  How We Do Things With Words: Analyzing Text as Social and Cultural Data , 2019, Frontiers in Artificial Intelligence.

[30]  John W. Meyer,et al.  The Worldwide Spread of Environmental Discourse in Social Studies, History, and Civics Textbooks, 1970–2008 , 2011, Comparative Education Review.

[31]  A. Graesser,et al.  Language and Discourse Analysis with Coh-Metrix: Applications from Educational Material to Learning Environments at Scale , 2016, J. Learn. Anal..

[32]  Frances FitzGerald,et al.  America Revised: History Schoolbooks in the Twentieth Century , 1980 .

[33]  Anthony L. Brown,et al.  Strange Fruit Indeed: Interrogating Contemporary Textbook Representations of Racial Violence toward African Americans , 2010, Teachers College Record: The Voice of Scholarship in Education.

[34]  D. Nguyen Text as social and cultural data : a computational perspective on variation in text , 2017 .

[35]  Thomas Banchoff,et al.  The Politics of the , 2002 .

[36]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[37]  Dirk Hovy,et al.  The Social Impact of Natural Language Processing , 2016, ACL.

[38]  Yulia Tsvetkov,et al.  Contextual Affective Analysis: A Case Study of People Portrayals in Online #MeToo Stories , 2019, ICWSM.

[39]  V. Greaney,et al.  Promoting Social Cohesion through Education : Case Studies and Tools for Using Textbooks and Curricula , 2006 .

[40]  Xuewei Zhang,et al.  Topic modeling for evaluating students' reflective writing: a case study of pre-service teachers' journals , 2016, LAK.

[41]  Jo Robinson A Girl Stands at the Door: The Generation of Young Women Who Desegregated America's Schools , 2019, Journal of American History.

[42]  Lise Getoor,et al.  Understanding MOOC Discussion Forums using Seeded LDA , 2014, BEA@ACL.

[43]  Nanyun Peng,et al.  Man is to Person as Woman is to Location: Measuring Gender Bias in Named Entity Recognition , 2019, HT.

[44]  Stefan Trausan-Matu,et al.  Mining Texts, Learner Productions and Strategies with ReaderBench , 2014 .

[45]  Ryan L. Boyd,et al.  The Development and Psychometric Properties of LIWC2015 , 2015 .

[46]  Michael Hines Learning Freedom: Education, Elevation, and New York's African-American Community, 1827–1829 , 2016, History of Education Quarterly.

[47]  Richard Lachmann,et al.  The Changing Face of War in Textbooks , 2014 .

[48]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[49]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[50]  Wayne Hughes United States History: Colonization through Reconstruction. A Curriculum Guide for Grade 7. , 1974 .

[51]  Ute Römer,et al.  Applying Natural Language Processing Tools to a Student Academic Writing Corpus: How Large are Disciplinary Differences Across Science and Engineering Fields? , 2017 .

[52]  R. Bechler The changing face of war , 2010 .

[53]  R. Blumberg,et al.  Gender bias in textbooks: a hidden obstacle on the road to gender equality in education , 2008 .

[54]  Kathleen M. Carley,et al.  Girls Rule, Boys Drool: Extracting Semantic and Affective Stereotypes from Twitter , 2017, CSCW.

[55]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[56]  J. Russell A circumplex model of affect. , 1980 .

[57]  Rebecca L. Collins,et al.  Content Analysis of Gender Roles in Media: Where Are We Now and Where Should We Go? , 2011 .

[58]  Emily K. Penner,et al.  The Causal Effects of Cultural Relevance , 2015 .

[59]  Falk Pingel,et al.  UNESCO guidebook on textbook research and textbook revision , 1999 .

[60]  P. Mehta Where Have All The Textbooks Gone , 2005 .

[61]  J. Nicholls Methods in School Textbook Research , 2003 .

[62]  J. Banks,et al.  APPROACHES TO MULTICULTURAL CURRICULUM REFORM , 1989 .

[63]  Mary Kay Thompson Tetreault Integrating Women's History: The Case of United States History High School Textbooks. , 1986 .

[64]  Arthur C. Graesser,et al.  Group communication analysis: A computational linguistics approach for detecting sociocognitive roles in multiparty interactions , 2018, Behavior Research Methods.

[65]  Scott A. Crossley,et al.  Analyzing Spoken and Written Discourse: A Role for Natural Language Processing Tools , 2018 .

[66]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[67]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[68]  J. Moreau Schoolbook Nation: Conflicts over American History Textbooks from the Civil War to the Present , 2003 .

[69]  David M. Mimno,et al.  Applications of Topic Models , 2017, Found. Trends Inf. Retr..

[70]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[71]  Catherine Cornbleth Images of America: What Youth Do Know About the United States , 2002 .

[72]  A. Ornaghi Stereotypes in High-Stakes Decisions: Evidence from U.S. Circuit Courts , 2019 .

[73]  David Mimno,et al.  Evaluating the Stability of Embedding-based Word Similarities , 2018, TACL.

[74]  Eric Foner Give me liberty: An american history / Eric Foner , 2006 .

[75]  H. Zinn The Twentieth Century: A People's History , 1984 .

[76]  Benjamin Van Durme,et al.  Reporting bias and knowledge acquisition , 2013, AKBC '13.

[77]  Luca Lugini,et al.  Annotating Student Talk in Text-based Classroom Discussions , 2018, BEA@NAACL-HLT.

[78]  Rachel Hutchins Heroes and the renegotiation of national identity in American history textbooks: representations of George Washington and Abraham Lincoln, 1982-2003 , 2011 .

[79]  D. Kennedy,et al.  The American pageant : a history of the American people , 2013 .

[80]  Sandra Wachholz,et al.  The Politics of the Textbook , 2001 .

[81]  Ryan Cotterell,et al.  Unsupervised Discovery of Gendered Language through Latent-Variable Modeling , 2019, ACL.

[82]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[83]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[84]  A. Morning Reconstructing Race in Science and Society: Biology Textbooks, 1952–20021 , 2008, American Journal of Sociology.

[85]  James A. Henretta America's History , 1987 .

[86]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[87]  Michael S. Bernstein,et al.  Shirtless and Dangerous: Quantifying Linguistic Signals of Gender Bias in an Online Fiction Writing Community , 2016, ICWSM.

[88]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[89]  Laurie L. Gordy,et al.  Redirecting our Voyage through History: A Content Analysis of Social Studies Textbooks , 1995 .

[90]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[91]  Brendan T. O'Connor,et al.  Computational Text Analysis for Social Science: Model Assumptions and Complexity , 2011 .

[92]  Alyssa Friend Wise,et al.  Topic models to support instructors in MOOC forums , 2017, LAK.

[93]  C. Harris,et al.  United States History to 1877 , 1991 .

[94]  Stuart J. Foster The struggle for American identity: treatment of ethnic groups in United States history textbooks , 1999 .