Word Counts and Topic Models

With digital journalism and social media producing huge amounts of digital content every day, journalism scholars are faced with new challenges to describe and analyze the wealth of information. Borrowing sophisticated tools and resources from computer science and computational linguistics, journalism scholars have started to gain insights into the constant information flow and made big data a regular feature of the scientific debate. Both deductive (manual and semi-automated) and inductive (fully automated) text analysis methods are part of this new toolset. In order to make the automated research process more tangible and provide an insight into the options available, we provide a roadmap of common (semi-)automated options for text analysis. We describe the assumptions and workflows of rule-based approaches, dictionaries, supervised machine learning, document clustering, and topic models. We show that automated methods have different strengths that provide different opportunities, enriching—but not replacing—the range of manual content analysis methods.

[1]  Adam Kilgarriff,et al.  Getting to Know Your Corpus , 2012, TSD.

[2]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[3]  C. Elkan,et al.  Topic Models , 2008 .

[4]  Claes H. de Vreese,et al.  Using Supervised Machine Learning to Code Policy Issues , 2015 .

[5]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[6]  Christian Wartena,et al.  Topic Detection by Clustering Keywords , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[7]  Yan Yan,et al.  Newspapers Connect with Readers through Multiple Digital Tools , 2011 .

[8]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[9]  Michael Scharkow,et al.  Thematic content analysis using supervised machine learning: An empirical evaluation using German online news , 2011, Quality & Quantity.

[10]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[11]  OpitzDavid,et al.  Popular ensemble methods , 1999 .

[12]  Nello Cristianini,et al.  RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM , 2013 .

[13]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[14]  Seth C. Lewis,et al.  Content Analysis and the Algorithmic Coder , 2015 .

[15]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[18]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[19]  Gary King,et al.  General purpose computer-assisted clustering and conceptualization , 2011, Proceedings of the National Academy of Sciences.

[20]  Helle Sjøvaag,et al.  Web media and the quantitative content analysis: Methodological challenges in measuring online news content , 2012 .

[21]  Tim Loughran,et al.  When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks , 2010 .

[22]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[23]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[24]  David G. Rand,et al.  Structural Topic Models for Open‐Ended Survey Responses , 2014, American Journal of Political Science.

[25]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[26]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[27]  William H. DuBay The Principles of Readability. , 2004 .

[28]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[29]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[30]  Merja Mahrt,et al.  The Value of Big Data in Digital Media Research , 2013 .

[31]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[32]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[33]  Michal Rosen-Zvi,et al.  Hidden Topic Markov Models , 2007, AISTATS.

[34]  S. Jay Samuels,et al.  Toward a theory of automatic information processing in reading , 1974 .