Topic modelling Methodology: its Use in Information Systems and other Managerial disciplines

Over the last decade, quantitative text mining approaches to content analysis have gained increasing traction within information systems research, and related fields, such as business administration. Recently, topic models, which are supposed to provide their user with an overview of themes being discussed in documents, have gained popularity. However, while convenient tools for the creation of this model class exist, the evaluation of topic models poses significant challenges to their users. In this research, we investigate how questions of model validity and trustworthiness of presented analyses are addressed across disciplines. We accomplish this by providing a structured review of methodological approaches across the Financial Times 50 journal ranking. We identify 59 methodological research papers, 24 implementations of topic models, as well as 33 research papers using topic models in Information Systems (IS) research, and 29 papers using such models in other managerial disciplines. Results indicate a need for model implementations usable by a wider audience, as well as the need for more implementations of model validation techniques, and the need for a discussion about the theoretical foundations of topic modelling based research.

[1]  Trisha Greenhalgh,et al.  Storylines of research in diffusion of innovation: a meta-narrative approach to systematic review. , 2005, Social science & medicine.

[2]  J. C. Mingers,et al.  Information and meaning: foundations for an intersubjective account , 1995, Inf. Syst. J..

[3]  Ronald L. Breiger,et al.  Ontologies, methodologies, and new uses of Big Data in the social and cultural sciences , 2015 .

[4]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[5]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[6]  David M. Blei,et al.  Deep Exponential Families , 2014, AISTATS.

[7]  Chong Wang,et al.  Collaborative topic modeling for recommending scientific articles , 2011, KDD.

[8]  Richard T. Watson,et al.  Analyzing the Past to Prepare for the Future: Writing a Literature Review , 2002, MIS Q..

[9]  Victor R. Prybutok,et al.  Latent Semantic Analysis: five methodological recommendations , 2012, Eur. J. Inf. Syst..

[10]  Timothy Baldwin,et al.  Evaluating topic models for digital libraries , 2010, JCDL '10.

[11]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[12]  Guy Paré,et al.  A Framework for Guiding and Evaluating Literature Reviews , 2015, Commun. Assoc. Inf. Syst..

[13]  Keyvan Vakili,et al.  The double-edged sword of recombination in breakthrough innovation , 2013 .

[14]  Chih-Ping Wei,et al.  Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach , 2006, J. Manag. Inf. Syst..

[15]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[16]  Ofer Arazy,et al.  Enhancing Information Retrieval Through Statistical Natural Language Processing: A Study of Collocation Indexing , 2007, MIS Q..

[17]  Hannu Vanharanta,et al.  Combining data and text mining techniques for analysing financial reports , 2004, Intell. Syst. Account. Finance Manag..

[18]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[19]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[20]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[22]  W. Kintsch,et al.  Metaphor Comprehension: What Makes a Metaphor Difficult to Understand? , 2002 .

[23]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[24]  Gabe Ignatow Theoretical Foundations for Digital Text Analysis , 2016 .

[25]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[26]  Shailesh S. Kulkarni,et al.  The Use of Latent Semantic Analysis in Operations Management Research , 2014, Decis. Sci..

[27]  Huidong Jin,et al.  A segmented topic model based on the two-parameter Poisson-Dirichlet process , 2010, Machine Learning.

[28]  Anindya Datta,et al.  Simultaneously Discovering and Quantifying Risk Types from Textual Risk Disclosures , 2014, Manag. Sci..

[29]  Joseph L. Austerweil,et al.  Analyzing the history of Cognition using Topic Models , 2015, Cognition.

[30]  Jan Muntermann,et al.  A method for taxonomy development and its application in information systems , 2013, Eur. J. Inf. Syst..

[31]  T. Landauer LSA as a Theory of Meaning , 2007 .

[32]  Fritz Günther,et al.  LSAfun - An R package for computations based on Latent Semantic Analysis , 2014, Behavior Research Methods.

[33]  J. B. Rosen,et al.  Lower dimensional representation of text data in vector space based information retrieval , 2001 .

[34]  Jan vom Brocke,et al.  Text Mining For Information Systems Researchers: An Annotated Topic Modeling Tutorial , 2016, Commun. Assoc. Inf. Syst..

[35]  Viswanath Venkatesh,et al.  Bridging the Qualitative-Quantitative Divide: Guidelines for Conducting Mixed Methods Research in Information Systems , 2013, MIS Q..

[36]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[37]  Chih-Ping Wei,et al.  Managing Word Mismatch Problems in Information Retrieval: A Topic-Based Query Expansion Approach , 2007, J. Manag. Inf. Syst..

[38]  Sonia Bergamaschi,et al.  Comparing LDA and LSA Topic Models for Content-Based Movie Recommendation Systems , 2014, WEBIST.

[39]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[40]  Francis Heylighen,et al.  Mining Associative Meanings from the Web: from word disambiguation to the global brain , 2001 .

[41]  Elena Gorbacheva,et al.  Towards a typology of business process management professionals: identifying patterns of competences through latent semantic analysis , 2016, Enterp. Inf. Syst..

[42]  Viswanath Venkatesh,et al.  Guidelines for Conducting Mixed-methods Research: An Extension and Illustration , 2016, J. Assoc. Inf. Syst..

[43]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[44]  Ernst C. Osinga,et al.  Big Data and Data Science Methods for Management Research , 2016 .

[45]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[46]  Petko Bogdanov,et al.  Introduction—Topic models: What they are and why they matter , 2013 .

[47]  Panagiotis G. Ipeirotis,et al.  Content and Context: Identifying the Impact of Qualitative Information on Consumer Choice , 2011, ICIS.

[48]  J. Leon Zhao,et al.  ISTopic: Understanding Information Systems Research through Topic Models , 2015, ICIS.

[49]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[50]  Mary Tate,et al.  Beyond synthesis: re-presenting heterogeneous research literature , 2013 .

[51]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[52]  Yihong Gong,et al.  Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[53]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[54]  S. Dumais Latent Semantic Analysis. , 2005 .

[55]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[56]  Anna Sidorova,et al.  Uncovering the Intellectual Core of the Information Systems Discipline , 2008, MIS Q..

[57]  Fabio Stella,et al.  Topic model validation , 2012, Neurocomputing.

[58]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[59]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[60]  Sanjai Bhagat,et al.  A Text-Based Analysis of Corporate Innovation , 2019, Manag. Sci..

[61]  Frank Hutson Gregory,et al.  Soft systems methodology to information systems: a Wittgensteinian approach , 1993, Inf. Syst. J..

[62]  Michael B. W. Wolfe,et al.  Use of latent semantic analysis for predicting psychological phenomena: Two issues and proposed solutions , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[63]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[64]  Hannu Vanharanta,et al.  Comparing numerical data and text information from annual reports using self-organizing maps , 2001, Int. J. Account. Inf. Syst..

[65]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[66]  Jen-Tzung Chien,et al.  Latent Dirichlet learning for document summarization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[67]  Feng Li The Information Content of Forward-Looking Statements in Corporate Filings—A Naïve Bayesian Machine Learning Approach , 2010 .

[68]  Anna Sidorova,et al.  Business process research: a cross-disciplinary review , 2010, Bus. Process. Manag. J..

[69]  Xin Wang,et al.  Uncovering the message from the mess of big data , 2016 .

[70]  Chong Wang,et al.  Online Variational Inference for the Hierarchical Dirichlet Process , 2011, AISTATS.

[71]  Richard P. Bagozzi,et al.  Measurement and Meaning in Information Systems and Organizational Research: Methodological and Philosophical Foundations , 2011, MIS Q..

[72]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[73]  Brendan T. O'Connor,et al.  Computational Text Analysis for Social Science: Model Assumptions and Complexity , 2011 .

[74]  James A. Evans,et al.  Machine Translation: Mining Text for Social Theory , 2016 .

[75]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[76]  K. Weber,et al.  Marks of Distinction , 2015 .

[77]  Daniel M. Dunlavy,et al.  TopicView: Visually Comparing Topic Models of Text Collections , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[78]  Michael J. Paul,et al.  Topic Modeling of Research Fields: An Interdisciplinary Perspective , 2009, RANLP.

[79]  D. Spence,et al.  Lexical co-occurrence and association strength , 1990 .

[80]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[81]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[82]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[83]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[84]  Hannu Vanharanta,et al.  Contents Matching Defined by Prototypes: Methodology Verification with Books of the Bible , 2002, J. Manag. Inf. Syst..

[85]  Chris Ding,et al.  On the Use of Singular Value Decomposition for Text Retrieval , 2000 .

[86]  James C. Wetherbe,et al.  An Empirical Comparison of Four Text Mining Methods , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[87]  Sean Gerrish,et al.  A Language-based Approach to Measuring Scholarly Impact , 2010, ICML.

[88]  Stuart J. Barnes,et al.  Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation , 2017 .

[89]  Michel Laroche,et al.  How Do Expressed Emotions Affect the Helpfulness of a Product Review? Evidence from Reviews Using Latent Semantic Analysis , 2015, Int. J. Electron. Commer..

[90]  P. Foltz Discourse Coherence and LSA , 2007 .

[91]  Kai R. Larsen,et al.  A Tool for Addressing Construct Identity in Literature Reviews and Meta-Analyses , 2016, MIS Q..

[92]  David M. Blei,et al.  Content-based recommendations with Poisson factorization , 2014, NIPS.

[93]  Sophie Mützel,et al.  Facing Big Data: Making sociology relevant , 2015 .

[94]  Michael H. Breitner,et al.  Enhancing literature Review Methods - towards More Efficient literature Research with Latent Semantic Indexing , 2014, ECIS.

[95]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[96]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[97]  Michael H. Breitner,et al.  Enhancing Literature Review Methods - Evaluation of a Literature Search Approach based on Latent Semantic Indexing , 2014, ICIS.

[98]  Dennis Fok,et al.  Model-based Purchase Predictions for Large Assortments , 2016, Mark. Sci..

[99]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[100]  Naren Ramakrishnan,et al.  Forex-foreteller: currency trend modeling using news articles , 2013, KDD.

[101]  P. Rita,et al.  A Text Mining-Based Review of Cause-Related Marketing Literature , 2015, Journal of Business Ethics.

[102]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[103]  Michael Trusov,et al.  Crumbs of the Cookie: User Profiling in Customer-Base Analysis and Behavioral Targeting , 2016, Mark. Sci..

[104]  Ellyn R. Boukus,et al.  The Information Content of FOMC Minutes , 2006 .

[105]  James E. Cicon,et al.  European Corporate Governance: A Thematic Analysis of National Codes of Governance , 2012 .

[106]  Sergey I. Nikolenko,et al.  Topic modelling for qualitative studies , 2017, J. Inf. Sci..

[107]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[108]  Sebastian K. Boell,et al.  On being ‘systematic’ in literature reviews in IS , 2015, J. Inf. Technol..

[109]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[110]  Claire Cardie,et al.  Multi-aspect Sentiment Analysis with Topic Models , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[111]  Michael J. Gallivan,et al.  Using Latent Semantic Analysis to Identify Themes in IS Healthcare Research , 2015, AMCIS.

[112]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[113]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[114]  Dave Elder-Vass,et al.  Debate: Seven Ways to be A Realist About Language , 2014 .

[115]  David R. Karger,et al.  Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections , 2017, SIGF.

[116]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[117]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[118]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[119]  Margaret E. Roberts,et al.  stm: An R Package for Structural Topic Models , 2019, Journal of Statistical Software.

[120]  Gary King,et al.  General purpose computer-assisted clustering and conceptualization , 2011, Proceedings of the National Academy of Sciences.

[121]  Frantz Rowe,et al.  What literature review is not: diversity, boundaries and recommendations , 2014, Eur. J. Inf. Syst..

[122]  Pär J. Ågerfalk Embracing diversity through mixed methods research , 2013, Eur. J. Inf. Syst..

[123]  Itziar Castelló,et al.  Strategies of Legitimacy Through Social Media: The Networked Strategy , 2016 .

[124]  Hsing Kenneth Cheng,et al.  Identifying Research Trends in IS , 2015, AMCIS.

[125]  Philip Resnik,et al.  SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations , 2012, ACL.

[126]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[127]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[128]  Danielle S. McNamara,et al.  Using LSA in AutoTutor: Learning Through Mixed-Initiative Dialogue in Natural Language , 2007 .

[129]  Wai Ming To,et al.  Content Analysis of Social Media: A Grounded Theory Approach , 2015 .

[130]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[131]  Paul DiMaggio,et al.  Adapting computational text analysis to social science (and vice versa) , 2015, Big Data Soc..

[132]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[133]  Reuven Lehavy,et al.  Analyst Information Discovery and Interpretation Roles: A Topic Modeling Approach , 2016 .

[134]  Timothy Baldwin,et al.  Visualizing search results and document collections using topic maps , 2010, J. Web Semant..

[135]  Bill McDonald,et al.  Textual Analysis in Accounting and Finance: A Survey , 2016 .

[136]  Kenneth E. Shirley,et al.  LDAvis: A method for visualizing and interpreting topics , 2014 .

[137]  Anna Sidorova,et al.  Diversity in IS Research: An Exploratory Study Using Latent Semantics , 2007, ICIS.

[138]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[139]  Christopher D. Manning,et al.  Topic Modeling for the Social Sciences , 2009 .

[140]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[141]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[142]  Arun Rai,et al.  Editor's comments: synergies between big data and theory , 2016 .

[143]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[144]  June Cotte,et al.  The Journal of Consumer Research at 40: A Historical Analysis , 2015 .

[145]  Carl F. Mela,et al.  A Topical History of JMR , 2014 .

[146]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[147]  G. Tellis,et al.  Mining Marketing Meaning from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation , 2014 .

[148]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[149]  Vipul Jain,et al.  A journey from normative to behavioral operations in supply chain management: A review using Latent Semantic Analysis , 2015, Expert Syst. Appl..

[150]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[151]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[152]  Thomas K. Landauer,et al.  On the computational basis of learning and cognition: Arguments from LSA , 2002 .

[153]  Hsinchun Chen,et al.  The information content of mandatory risk factor disclosures in corporate filings , 2010 .

[154]  Yong Hu,et al.  The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature , 2011, Decis. Support Syst..