Using Machine Learning to Support Qualitative Coding in Social Science

Machine learning (ML) has become increasingly influential to human society, yet the primary advancements and applications of ML are driven by research in only a few computational disciplines. Even applications that affect or analyze human behaviors and social structures are often developed with limited input from experts outside of computational fields. Social scientists—experts trained to examine and explain the complexity of human behavior and interactions in the world—have considerable expertise to contribute to the development of ML applications for human-generated data, and their analytic practices could benefit from more human-centered ML methods. Although a few researchers have highlighted some gaps between ML and social sciences [51, 57, 70], most discussions only focus on quantitative methods. Yet many social science disciplines rely heavily on qualitative methods to distill patterns that are challenging to discover through quantitative data. One common analysis method for qualitative data is qualitative coding. In this article, we highlight three challenges of applying ML to qualitative coding. Additionally, we utilize our experience of designing a visual analytics tool for collaborative qualitative coding to demonstrate the potential in using ML to support qualitative coding by shifting the focus to identifying ambiguity. We illustrate dimensions of ambiguity and discuss the relationship between disagreement and ambiguity. Finally, we propose three research directions to ground ML applications for social science as part of the progression toward human-centered machine learning.

[1]  Anselm L. Strauss,et al.  Qualitative Analysis For Social Scientists , 1987 .

[2]  Xiaoru Wang,et al.  SVMV - A Novel Algorithm for the Visualization of SVM Classification Results , 2006, ISNN.

[3]  Taylor Jackson Scott,et al.  Statistical affect detection in collaborative chat , 2013, CSCW.

[4]  Carlos Guestrin,et al.  Model-Agnostic Interpretability of Machine Learning , 2016, ArXiv.

[5]  Michael Brooks,et al.  Human Centered Tools for Analyzing Online Social Data , 2015 .

[6]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[7]  Patrick J. Tierney,et al.  A qualitative analysis framework using natural language processing and graph theory , 2012 .

[8]  Cecilia R. Aragon,et al.  Aeonium: Visual analytics to support collaborative qualitative coding , 2017, 2017 IEEE Pacific Visualization Symposium (PacificVis).

[9]  Athanasios V. Vasilakos,et al.  Big data: From beginning to future , 2016, Int. J. Inf. Manag..

[10]  Martin Wattenberg,et al.  Visualizing Dataflow Graphs of Deep Learning Models in TensorFlow , 2018, IEEE Transactions on Visualization and Computer Graphics.

[11]  Giles Hooker,et al.  Discovering additive structure in black box functions , 2004, KDD.

[12]  K. Charmaz,et al.  Constructing Grounded Theory , 2014 .

[13]  Noah A. Smith,et al.  A Dependency Parser for Tweets , 2014, EMNLP.

[14]  Helmut Krcmar,et al.  Big Data , 2014, Wirtschaftsinf..

[15]  Shenyang Guo,et al.  Structural Equation Modeling , 2011 .

[16]  N. Denzin,et al.  The SAGE handbook of qualitative research , 2005 .

[17]  Kevin Crowston,et al.  Optimizing Features in Active Machine Learning for Complex Qualitative Content Analysis , 2014, LTCSS@ACL.

[18]  H. V. Jagadish Moving past the "Wild West" era for Big Data , 2015, BigData.

[19]  Johnny Saldaña,et al.  The Coding Manual for Qualitative Researchers , 2009 .

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Paulo J. G. Lisboa,et al.  Making machine learning models interpretable , 2012, ESANN.

[22]  M. Sheelagh T. Carpendale,et al.  Analyzing Qualitative Data , 2017, ISS.

[23]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[24]  Solon Barocas,et al.  D ATA M INING AND THE D ISCOURSE ON D ISCRIMINATION , 2014 .

[25]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[26]  Bill Doult,et al.  On with the new. , 1996, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[27]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[28]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[29]  W. Neuman,et al.  Social Research Methods: Qualitative and Quantitative Approaches , 2002 .

[30]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[31]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[32]  G. Imbens,et al.  Machine Learning Methods for Estimating Heterogeneous Causal Eects , 2015 .

[33]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[34]  D. Watts The “New” Science of Networks , 2004 .

[35]  Paulo J. G. Lisboa,et al.  Research directions in interpretable machine learning models , 2013, ESANN.

[36]  Cyrus Samii,et al.  Retrospective Causal Inference with Machine Learning Ensembles: An Application to Anti-recidivism Policies in Colombia , 2016, Political Analysis.

[37]  Daniel A. Keim,et al.  Human-centered machine learning through interactive visualization: review and open challenges , 2016, ESANN.

[38]  Hanna Wallach Computational Social Science: Toward a Collaborative Future , 2016 .

[39]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[40]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[41]  Cde,et al.  Theoretical Coding in Grounded Theory Methodology , 2009 .

[42]  Judith A. Holton,et al.  Remodeling Grounded Theory , 2004 .

[43]  Judith A. Holton,et al.  The Coding Process and Its Challenges , 2010 .

[44]  Kate Starbird,et al.  Social Media, Public Participation, and the 2010 BP Deepwater Horizon Oil Spill , 2015 .

[45]  Marjorie Darrah Neural Network Visualization Techniques , 2006 .

[46]  Kevin Crowston,et al.  A capability maturity model for scientific data management , 2010, ASIST.

[47]  Bart Peeters,et al.  Introduction to the KWALON Experiment: Discussions on Qualitative Data Analysis Software by Developers and Users , 2011 .

[48]  Melanie Birks,et al.  The Methodological Dynamism of Grounded Theory , 2015 .

[49]  H. Russell Bernard,et al.  Social Research Methods: Qualitative and Quantitative Approaches , 2000 .

[50]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[51]  Leysia Palen,et al.  Microblogging during two natural hazards events: what twitter may contribute to situational awareness , 2010, CHI.

[52]  Klaus-Robert Müller,et al.  Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models , 2017, ArXiv.

[53]  Bill Fitzgerald,et al.  Algorithms and Bias: Q. and A. With Cynthia Dwork , 2015 .

[54]  Justin Grimmer,et al.  We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together , 2014, PS: Political Science & Politics.

[55]  Kevin Crowston,et al.  Machine learning and rule-based automated coding of qualitative data , 2010, ASIST.

[56]  Finale Doshi-Velez,et al.  Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction , 2015, NIPS.

[57]  K. Seers Qualitative data analysis , 2011, Evidence Based Nursing.

[58]  Hanna Wallach Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency , 2019 .

[59]  Maya Cakmak,et al.  Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[60]  Shion Guha,et al.  Machine Learning and Grounded Theory Method: Convergence, Divergence, and Combination , 2016, GROUP.

[61]  Alfred Hermida,et al.  Content Analysis in an Era of Big Data: A Hybrid Approach to Computational and Manual Methods , 2013 .

[62]  B MilesMatthew,et al.  Qualitative Data Analysis , 2009, Approaches and Processes of Social Science Research.

[63]  Sean A. Munson,et al.  Unequal Representation and Gender Stereotypes in Image Search Results for Occupations , 2015, CHI.

[64]  Kevin Crowston,et al.  Using natural language processing technology for qualitative data analysis , 2012, International Journal of Social Research Methodology.

[65]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[66]  Hal R. Varian,et al.  Big Data: New Tricks for Econometrics , 2014 .

[67]  Tony Doyle,et al.  Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy , 2017, Inf. Soc..

[68]  Gregor Wiedemann,et al.  Opening up to Big Data: Computer-Assisted Analysis of Textual Data in Social Sciences , 2013 .

[69]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[70]  Gregor Wiedemann Text Mining for Qualitative Data Analysis in the Social Sciences , 2016 .

[71]  Gabriela Beirão,et al.  Understanding attitudes towards public transport and private car: A qualitative study , 2007 .

[72]  Theodore M. Porter From Quetelet to Maxwell: Social Statistics and the Origins of Statistical Physics , 1994 .

[73]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[74]  R. Tesch Qualitative Research: Analysis Types and Software , 1990 .

[75]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[76]  Martin Wattenberg,et al.  Embedding Projector: Interactive Visualization and Interpretation of Embeddings , 2016, ArXiv.

[77]  Taha Yasseri,et al.  A Biased Review of Biases in Twitter Studies on Political Collective Action , 2016, Front. Phys..

[78]  Latanya Sweeney,et al.  Discrimination in online ad delivery , 2013, CACM.