Using Machine Learning to Enhance Archival Processing of Social Media Archives

This article reports on a study using machine learning to identify incidences and shifting dynamics of hate speech in social media archives. To better cope with the archival processing need for such large-scale and fast evolving archives, we propose the Data-driven and Circulating Archival Processing (DCAP) method. As a proof-of-concept, our study focuses on an English language Twitter archive relating to COVID-19: Tweets were repeatedly scraped between February and June 2020, ingested and aggregated within the COVID-19 Hate Speech Twitter Archive (CHSTA), and analyzed for hate speech using the Generative Adversarial Network–inspired DCAP method. Outcomes suggest that it is possible to use machine learning and data analytics to surface and substantiate trends from CHSTA and similar social media archives that could provide immediately useful knowledge for crisis response, in controversial situations, or for public policy development, as well as for subsequent historical analysis. The approach shows potential for integrating multiple aspects of the archival workflow and supporting automatic iterative redescription and reappraisal activities in ways that make them more accountable and more rapidly responsive to changing societal interests and unfolding developments.

[1]  Anne J. Gilliland,et al.  #StopAsianHate: Archiving and Analyzing Twitter Discourse in the Wake of the 2021 Atlanta Spa Shootings , 2021, ASIST.

[2]  Anne J. Gilliland-Swetland,et al.  Using a Three-step Social Media Similarity (TSMS) Mapping Method to Analyze Controversial Speech Relating to COVID-19 in Twitter Collections , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[3]  K. Franz Documenting COVID-19 , 2020, Journal of American History (Bloomington, Ind.).

[4]  Lizhou Fan,et al.  Stigmatization in social media: Documenting and analyzing hate speech for COVID‐19 on Twitter , 2020, ASIST.

[5]  Angela R. Gover,et al.  Anti-Asian Hate Crime During the COVID-19 Pandemic: Exploring the Reproduction of Inequality , 2020, American journal of criminal justice : AJCJ.

[6]  A. Masso,et al.  Understanding power positions in a new digital landscape: perceptions of Syrian refugees and data experts on relocation algorithm , 2020 .

[7]  A. Johnson,et al.  Stigmatization and prejudice during the COVID-19 pandemic , 2020 .

[8]  S. Croucher,et al.  Prejudice Toward Asian Americans in the Covid-19 Pandemic: The Effects of Social Media Use in the United States , 2020, Frontiers in Communication.

[9]  Grace S Kao,et al.  The Anxiety of Being Asian American: Hate Crimes and Negative Biases During the COVID-19 Pandemic , 2020, American journal of criminal justice : AJCJ.

[10]  M. Rieder,et al.  Evidence for Limited Early Spread of COVID-19 Within the United States, January–February 2020 , 2020, MMWR. Morbidity and mortality weekly report.

[11]  Kristina Lerman,et al.  Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set , 2020, JMIR public health and surveillance.

[12]  Daqing He,et al.  Global health crises are also information crises: A call to action , 2020, J. Assoc. Inf. Sci. Technol..

[13]  S. Merz Race after technology. Abolitionist tools for the new Jim Code , 2020, Ethnic and Racial Studies.

[14]  Richard Marciano,et al.  Computational Thinking in Archival Science Research and Education , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[15]  M. Caswell,et al.  Neither a beginning nor an end , 2019, The Routledge International Handbook of New Digital Practices in Galleries, Libraries, Archives, Museums and Heritage Sites.

[16]  Amelia Acker,et al.  Social media data archives in an API-driven world , 2019, Archival Science.

[17]  M. Williams,et al.  Hate in the Machine: Anti-Black and Anti-Muslim Social Media Posts as Predictors of Offline Racially and Religiously Aggravated Crime , 2019, The British Journal of Criminology.

[18]  Richard A. Rogers,et al.  Doing Digital Methods , 2019 .

[19]  Ziqi Zhang,et al.  Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter , 2018, Semantic Web.

[20]  Anne J. Gilliland-Swetland,et al.  Human Security Informatics, Global Grand Challenges and Digital Curation , 2019, Int. J. Digit. Curation.

[21]  Alex Galarza Documenting the Now , 2018, Journal of American History.

[22]  Micah Altman,et al.  A Grand Challenges-Based Research Agenda for Scholarly Communication and Information Science , 2018, MIT Grand Challenge Participation Platform.

[23]  Holger Pötzsch,et al.  Archives and identity in the context of social media and algorithmic analytics: Towards an understanding of iArchive and predictive retention , 2018, New Media Soc..

[24]  D. Fitch,et al.  Review of "Algorithms of oppression: how search engines reinforce racism," by Noble, S. U. (2018). New York, New York: NYU Press. , 2018, CDQR.

[25]  Christian Reuter,et al.  Retrospective Review and Future Directions for Crisis Informatics , 2021, Information Refinement Technologies for Crisis Informatics.

[26]  Mike Ananny,et al.  Seeing without knowing: Limitations of the transparency ideal and its application to algorithmic accountability , 2018, New Media Soc..

[27]  Richard Marciano,et al.  Heuristics for assessing Computational Archival Science (CAS) research: The case of the human face of big data project , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[28]  Nick Seaver Algorithms as culture: Some tactics for the ethnography of algorithmic systems , 2017, Big Data Soc..

[29]  Tony Doyle,et al.  Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , 2017, Inf. Soc..

[30]  Björn Gambäck,et al.  Using Convolutional Neural Networks to Classify Hate-Speech , 2017, ALW@ACL.

[31]  Pascale Fung,et al.  One-step and Two-step Classification for Abusive Language Detection on Twitter , 2017, ALW@ACL.

[32]  Luciano Floridi,et al.  Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation , 2017 .

[33]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[34]  Joel R. Tetreault,et al.  Do Characters Abuse More Than Words? , 2016, SIGDIAL Conference.

[35]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[36]  M. Williams,et al.  Us and them: identifying cyber hate on Twitter across multiple protected characteristics , 2016, EPJ Data Science.

[37]  Anne J. Gilliland Designing Expert Systems for Archival Evaluation and Processing of Computer Mediated Communications: Frameworks and Methods , 2016 .

[38]  Andrew D. Selbst,et al.  Big Data's Disparate Impact , 2016 .

[39]  Tomer Simon,et al.  Socializing in emergencies - A review of the use of social media in emergency situations , 2015, Int. J. Inf. Manag..

[40]  Anne J. Gilliland Permeable Binaries, Societal Grand Challenges, and the Roles of the Twenty-first-century Archival and Recordkeeping Profession , 2015 .

[41]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[42]  Anne J. Gilliland,et al.  Mobilizing records: re-framing archival description to support human rights , 2014 .

[43]  Yuzhou Wang,et al.  Locate the Hate: Detecting Tweets against Blacks , 2013, AAAI.

[44]  Geoffrey Yeo,et al.  Archival description in the era of digital abundance , 2013 .

[45]  J. Parikka,et al.  Digital Memory and the Archive , 2012 .

[46]  Irina Shklovski,et al.  Emergency Management, Twitter, and Social Media Evangelism , 2011, Int. J. Inf. Syst. Crisis Response Manag..

[47]  Anne Gilliland-Swetland,et al.  Enhancing archival description for public computer conferences of historical value: an exploratory study , 2009 .

[48]  William Osei-Poku,et al.  Encoded Archival Description , 2009 .

[49]  Leonid Kruglyak,et al.  Rise of the Machines , 2008, PLoS genetics.

[50]  Mark A. Greene,et al.  More Product, Less Process: Revamping Traditional Archival Processing , 2007 .

[51]  Alan F. Smeaton,et al.  Classifying racist texts using a support vector machine , 2004, SIGIR '04.

[52]  W. Duff,et al.  Stories and names: Archival description as narrating records and constructing meanings , 2002 .

[53]  Anne J. Gilliland-Swetland Popularizing the Finding Aid , 2001 .

[54]  Chaitanya K. Baru,et al.  Collection-Based Persistent Digital Archives - Part 1 , 2000, D Lib Mag..

[55]  David A. Wallace,et al.  Managing the Present: Metadata as Archival Description , 1995 .

[56]  Heather MacNeil,et al.  Metadata Strategies and Archival Description: Comparing Apples to Oranges , 1995 .

[57]  Tanya Zanish-Belcher,et al.  Society of American archivists , 1993, Arch. Mus. Informatics.