Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

Understanding the shopping motivations behind market baskets has high commercial value in the grocery retail industry. Analyzing shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while keeping interpretable outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to process grocery transactions and to discover a broad representation of customers' shopping motivations. However, summarizing the posterior distribution of an LDA model is challenging, while individual LDA draws may not be coherent and cannot capture topic uncertainty. Moreover, the evaluation of LDA models is dominated by model-fit measures which may not adequately capture the qualitative aspects such as interpretability and stability of topics. In this paper, we introduce clustering methodology that post-processes posterior LDA draws to summarise the entire posterior distribution and identify semantic modes represented as recurrent topics. Our approach is an alternative to standard label-switching techniques and provides a single posterior summary set of topics, as well as associated measures of uncertainty. Furthermore, we establish a more holistic definition for model evaluation, which assesses topic models based not only on their likelihood but also on their coherence, distinctiveness and stability. By means of a survey, we set thresholds for the interpretation of topic coherence and topic similarity in the domain of grocery retail data. We demonstrate that the selection of recurrent topics through our clustering methodology not only improves model likelihood but also outperforms the qualitative aspects of LDA such as interpretability and stability. We illustrate our methods on an example from a large UK supermarket chain.

[1]  Matt Taddy,et al.  On Estimation and Selection for Topic Models , 2011, AISTATS.

[2]  Kenneth E. Shirley,et al.  LDAvis: A method for visualizing and interpreting topics , 2014 .

[3]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[4]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[5]  Yee Whye Teh,et al.  Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes , 2004, NIPS.

[6]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[7]  Mark Stevenson,et al.  Measuring the Similarity between Automatically Generated Topics , 2014, EACL.

[8]  T. Minka Estimating a Dirichlet distribution , 2012 .

[9]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[10]  Leila Sadeghi,et al.  A Study about Discovery of Critical Food Consumption Patterns Linked with Lifestyle Diseases using Data Mining Methods , 2015, HEALTHINF.

[11]  Jeffrey Heer,et al.  TopicCheck: Interactive Alignment for Assessing Topic Model Stability , 2015, NAACL.

[12]  Harald Hruschka,et al.  Hidden Variable Models for Market Basket Data. Statistical Performance and Managerial Implications , 2016 .

[13]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[14]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[15]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Ernst Wit,et al.  Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models , 2010, Stat. Comput..

[17]  Harald Hruschka,et al.  Linking Multi-Category Purchases to Latent Activities of Shoppers: Analysing Market Baskets by Topic Models , 2014 .

[18]  Kai Zhang,et al.  Mining common topics from multiple asynchronous text streams , 2009, WSDM '09.

[19]  Rossano Schifanella,et al.  Large-scale and high-resolution analysis of food purchases and health outcomes , 2019, EPJ Data Science.

[20]  Dennis Fok,et al.  Model-based Purchase Predictions for Large Assortments , 2016, Mark. Sci..

[21]  F. Hu,et al.  Fruit and vegetable consumption and mortality from all causes, cardiovascular disease, and cancer: systematic review and dose-response meta-analysis of prospective cohort studies , 2014, BMJ : British Medical Journal.

[22]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[23]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Yaxin Bi,et al.  Increasing Topic Coherence by Aggregating Topic Models , 2016, KSEM.

[26]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[27]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[28]  Max Welling,et al.  Distributed Algorithms for Topic Models , 2009, J. Mach. Learn. Res..

[29]  Michael J. Paul,et al.  Diagnosing and Improving Topic Models by Analyzing Posterior Variability , 2018, AAAI.

[30]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[31]  Wang Yongliang,et al.  Multi-LDA hybrid topic model with boosting strategy and its application in text classification , 2014, Proceedings of the 33rd Chinese Control Conference.

[32]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[33]  Ajay Jasra,et al.  Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling , 2005 .

[34]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[35]  J. Wardle,et al.  Eating behaviour and obesity , 2007, Obesity reviews : an official journal of the International Association for the Study of Obesity.

[36]  Wray L. Buntine Estimating Likelihoods for Topic Models , 2009, ACML.

[37]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[38]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[39]  M. Stephens Dealing with label switching in mixture models , 2000 .

[40]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[41]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[42]  Daniel Barbará,et al.  Topic Significance Ranking of LDA Generative Models , 2009, ECML/PKDD.