Text as big data: Develop codes of practice for rigorous computational text analysis in energy social science

Abstract Augmenting traditional social science methods with computational analysis is crucial if we are to exploit the vast digital archives of text data that have become available over the past two decades. In this journal, Benites-Lazaro et al. [1] showcase this in an application of topic modeling and other computational methods to an actor-specific examination of changes in policy discourse on ethanol in Brazil and point out methodological promises and challenges. However, their contribution also highlights the need for establishing codes of practice for computational text analysis. In this perspective, we discuss five areas for improvement when treating text as big data in light of guiding principles from computational research – transparency, reproducibility and validation – to facilitate rigorous research practice: (1) full transparency over data collection and corpus construction, (2) comprehensive method descriptions that enable reproducibility by other researchers, (3) application of rigorous model validation procedures, (4) results interpretation based on primary text and clear research design and (5) critical discussion and contextualization of main findings. We conclude that the energy social science community needs to develop codes of practice to build on the promising research within the field of computational text analysis and suggest first steps into this direction.

[1]  B. Sovacool What Are We Doing Here? Analyzing Fifteen Years of Energy Scholarship and Proposing a Social Science Research Agenda , 2014 .

[2]  M. Lahsen,et al.  Business storytelling about energy and climate change: The case of Brazil’s ethanol industry , 2017 .

[3]  Patrik Svensson,et al.  The Landscape of Digital Humanities , 2010, Digit. Humanit. Q..

[4]  Chris Reed,et al.  Argument Mining: A Survey , 2020, Computational Linguistics.

[5]  A. Giarolla,et al.  Topic modeling method for analyzing social actor discourses on climate change, energy and food security , 2018, Energy Research & Social Science.

[6]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[7]  Roger D Peng,et al.  Reproducible research and Biostatistics. , 2009, Biostatistics.

[8]  Biljana Macura,et al.  The role of reporting standards in producing robust literature reviews , 2018, Nature Climate Change.

[9]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[10]  Jing Liao,et al.  Did a change in Nature journals’ editorial policy for life sciences research improve reporting? , 2019, BMJ Open Science.

[11]  K. Isoaho,et al.  A critical review of discursive approaches in energy transitions , 2019, Energy Policy.

[12]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[13]  Christopher M. Danforth,et al.  Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter , 2011, PloS one.

[14]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[15]  Christopher M. Danforth,et al.  Climate Change Sentiment on Twitter: An Unsolicited Public Opinion Poll , 2015, PloS one.

[16]  Ralf Krestel,et al.  Domain-specific word embeddings for patent classification , 2019, Data Technol. Appl..

[17]  Benjamin Hofner,et al.  Reproducible research in statistics: A review and guidelines for the Biometrical Journal , 2016, Biometrical journal. Biometrische Zeitschrift.

[18]  R. Kitchin,et al.  Big Data, new epistemologies and paradigm shifts , 2014, Big Data Soc..

[19]  Michael Röder,et al.  Exploring the Space of Topic Coherence Measures , 2015, WSDM.

[20]  Arho Toikka,et al.  A Big Data View of the European Energy Union: Shifting from ‘a Floating Signifier’ to an Active Driver of Decarbonisation? , 2019, Politics and Governance.

[21]  Daria Gritsenko,et al.  Vodka on ice? Unveiling Russian media perceptions of the Arctic , 2016 .

[22]  Nick Obradovich,et al.  Rapidly declining remarkability of temperature anomalies may obscure public perception of climate change , 2019, Proceedings of the National Academy of Sciences.

[23]  C. Madu,et al.  Modeling landscape sustainability in the oil producing Niger delta area of Nigeria , 2019, Energy Policy.

[24]  Claudio Cioffi-Revilla,et al.  Computational social science , 2010 .

[25]  Giovanni Baiocchi,et al.  Reproducible research in computational economics: guidelines, integrated approaches, and open source software , 2007 .

[26]  L. L. Benites-Lazaro,et al.  CSR as a legitimatizing tool in carbon market: Evidence from Latin America’s Clean Development Mechanism , 2017 .

[27]  Thomas Jacobs,et al.  Topic models meet discourse analysis: a quantitative tool for a qualitative approach , 2019, International Journal of Social Research Methodology.

[28]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[29]  Rada Mihalcea,et al.  Text Mining: A Guidebook for the Social Sciences , 2016 .

[30]  Claudio Cioffi-Revilla Computational Social Science , 2010 .

[31]  Yolanda Gil,et al.  Enhancing reproducibility for computational methods , 2016, Science.

[32]  J. Fowler,et al.  Rapid assessment of disaster damage using social media activity , 2016, Science Advances.

[33]  Derek Greene,et al.  An analysis of the coherence of descriptors in topic modeling , 2015, Expert Syst. Appl..

[34]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[35]  Matt Taddy,et al.  Text As Data , 2017, Journal of Economic Literature.

[36]  Eric M Prager,et al.  Improving transparency and scientific rigor in academic publishing , 2018, Brain and behavior.

[37]  M. Hajer,et al.  A decade of discourse analysis of environmental politics: Achievements, challenges, perspectives , 2005 .

[38]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[39]  Benjamin K. Sovacool,et al.  Promoting novelty, rigor, and style in energy social science: Towards codes of practice for appropriate methods and research design , 2018, Energy Research & Social Science.

[40]  William F. Lamb,et al.  Fast growing research on negative emissions , 2017 .

[41]  Silke Adam,et al.  Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology , 2018 .

[42]  Eetu Mäkelä,et al.  Topic Modeling and Text Analysis for Qualitative Policy Research , 2019, Policy Studies Journal.

[43]  L. Sanderink Shattered frames in global energy governance: Exploring fragmented interpretations among renewable energy institutions , 2020 .

[44]  Nello Cristianini,et al.  Content analysis of 150 years of British periodicals , 2017, Proceedings of the National Academy of Sciences.

[45]  Petter Törnberg,et al.  Muslims in social media discourse: Combining topic modeling and critical discourse analysis , 2016 .

[46]  John Unsworth,et al.  A Companion to Digital Humanities , 2008 .

[47]  A. Saltelli,et al.  Ethics of quantification: illumination, obfuscation and performative legitimation , 2020, Palgrave Communications.

[48]  Michael Gleicher,et al.  Task-Driven Comparison of Topic Models , 2016, IEEE Transactions on Visualization and Computer Graphics.

[49]  Ilkka Tuomi Data is more than knowledge: implications of the reversed knowledge hierarchy for knowledge management and organizational memory , 1999 .

[50]  R. Hirschheim INFORMATION SYSTEMS EPISTEMOLOGY: AN HISTORICAL PERSPECTIVE , 2000 .

[51]  G. Miller Sociology. Social scientists wade into the tweet stream. , 2011, Science.

[52]  H. Klüver,et al.  Measuring Interest Group Influence Using Quantitative Text Analysis , 2009 .

[53]  Christopher Gandrud,et al.  Reproducible Research with R and RStudio , 2013 .

[54]  Loren Collingwood,et al.  Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods , 2012 .

[55]  Sebastian Benthall,et al.  Philosophy of Computational Social Science , 2016 .

[56]  E. Grubert,et al.  Villainous or valiant? Depictions of oil and coal in American fiction and nonfiction narratives , 2017 .

[57]  Nader Shaikh,et al.  A checklist is associated with increased quality of reporting preclinical biomedical research: A systematic review , 2017, PloS one.

[58]  Matthew L. Jockers,et al.  Text‐Mining the Humanities , 2015 .

[59]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[60]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[61]  Ronald N. Giere,et al.  ESP and Psychokinesis: A Philosophical Examination , 1980 .

[62]  Margaret E. Roberts,et al.  A Model of Text for Experimentation in the Social Sciences , 2016 .

[63]  David Mimno,et al.  Evaluating the Stability of Embedding-based Word Similarities , 2018, TACL.

[64]  Manuel W. Bickel Reflecting trends in the academic landscape of sustainable energy using probabilistic topic modeling , 2019 .

[65]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[66]  Abraham S. D. Tidwell,et al.  Energy ideals, visions, narratives, and rhetoric: Examining sociotechnical imaginaries theory and methodology in energy research , 2018 .

[67]  Anisa Rowhani-Farid,et al.  Badges for sharing data and code at Biostatistics: an observational study , 2018, F1000Research.

[68]  Hans Ekkehard Plesser,et al.  Reproducibility vs. Replicability: A Brief History of a Confused Terminology , 2018, Front. Neuroinform..