Data pruning in consumer choice models

Common, if not ubiquitous, Marketing practice when estimating models for scanner panel data is to: (a) observe the data, (b) prune the data to a “manageable” number of brands or SKUs, and (c) fit models to the remaining data. We demonstrate that such pruning practice can lead to significantly different (and potentially biased) elasticities, and hence different managerial/practical outcomes, especially in the context of model misspecification. We first justify our claims theoretically by writing the general problem in a classic missing-data framework and demonstrate that commonly used pruning mechanisms (gleaned from current academic Marketing literature) can lead to a nonignorable missing data mechanism. Secondly, we summarize an extensive set of simulations that were run to understand the driving factors of that bias. The results indicate much greater pruning bias in those cases where model fit is poor (small $$R^{2}$$), random utility errors are correlated with the covariates, or the model is misspecified (e.g., a homogeneous logit is specified when a mixed-logit is true). Empirically, we also demonstrate our findings on the well-cited and highly utilized fabric softener data of Fader and Hardie (1996). Our empirical findings suggest a number of estimates that vary according to the way in which the data is pruned including the magnitude of market mix and attribute elasticities, and purchase probabilities, but that the pruning effect is smaller for better fitting models.

[1]  Gary J. Russell,et al.  A Probabilistic Choice Model for Market Segmentation and Elasticity Structure , 1989 .

[2]  Peter E. Rossi,et al.  Purchase frequency, sample selection, and price sensitivity: The heavy-user bias , 1994 .

[3]  John D. C. Little,et al.  A Logit Model of Brand Choice Calibrated on Scanner Data , 2011, Mark. Sci..

[4]  Peter S. Fader,et al.  Modeling Consumer Choice among SKUs , 1996 .

[5]  Peter E. Rossi,et al.  Response Modeling with Nonrandom Marketing-Mix Variables , 2004 .

[6]  Robert E. Krider,et al.  Competitive Dynamics and the Introduction of New Products: The Motion Picture Timing Game , 1998 .

[7]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[8]  W. Dunlap,et al.  The Accuracy of Different Methods for Estimating the Standard Error of Correlations Corrected for Range Restriction , 1997 .

[9]  K. Land,et al.  Estimating the Effect of Nonignorable Nonresponse in Sample Surveys , 1993 .

[10]  J. Heckman The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models , 1976 .

[11]  F. Nelson Censored Regression Models with Unobserved Stochastic Censoring Thresholds , 1974 .

[12]  Jerry A. Hausman,et al.  Social Experimentation, Truncated Distributions, and Efficient Estimation , 1977 .

[13]  Teck-Hua Ho,et al.  A Parsimonious Model of Stockkeeping-Unit Choice , 2003 .

[14]  J. Mendoza,et al.  A Bootstrap Confidence Interval Based on a Correlation Corrected for Range Restriction. , 1991, Multivariate behavioral research.

[15]  R. Sugden,et al.  Ignorable and informative designs in survey sampling inference , 1984 .

[16]  J. Heckman Sample selection bias as a specification error , 1979 .

[17]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[18]  S. Siddarth,et al.  Determining Segmentation in Sales Response across Consumer Purchase Behaviors , 1998 .

[19]  Carl F. Mela,et al.  The Long-Term Impact of Promotions on Consumer Stockpiling Behavior , 1998 .

[20]  D. Rubin,et al.  Handling “Don't Know” Survey Responses: The Case of the Slovenian Plebiscite , 1995 .

[21]  William S. Reece,et al.  Imputation of Missing Values When the Probability of Response Depends on the Variable Being Imputed , 1982 .

[22]  D. McFadden,et al.  MIXED MNL MODELS FOR DISCRETE RESPONSE , 2000 .

[23]  D. Bell,et al.  Looking for Loss Aversion in Scanner Panel Data: The Confounding Effect of Price Response Heterogeneity , 2000 .

[24]  Charles B. Weinberg,et al.  The Impact of Heterogeneity in Purchase Timing and Price Responsiveness on Estimates of Sticker Shock Effects , 1999 .

[25]  D. Holmes,et al.  The robustness of the usual correction for restriction in range due to explicit selection , 1990 .

[26]  Malcolm James Ree,et al.  Rangej: A Pascal Program to Compute the Multivariate Correction for Range Restriction , 1994 .

[27]  J. Heckman Shadow prices, market wages, and labor supply , 1974 .

[28]  D. DeMets,et al.  Estimation of a Simple Regression Coefficient in Samples Arising from a Sub-Sampling Procedure , 1977 .

[29]  Paul P. Foley,et al.  Explanations for Accuracy of the General Multivariate Formulas in Correcting for Range Restriction , 1994 .

[30]  Pradeep K. Chintagunta,et al.  Inertia and Variety Seeking in a Model of Brand-Purchase Timing , 1998 .

[31]  D. Pfeffermann The Role of Sampling Weights when Modeling Survey Data , 1993 .

[32]  Kirthi Kalyanam,et al.  Estimating Irregular Pricing Effects: A Stochastic Spline Regression Approach , 1998 .

[33]  Chris J. Skinner,et al.  Allowing for non‐ignorable non‐response in the analysis of voting intention data , 1999 .

[34]  Carl F. Mela,et al.  Managing Advertising and Promotion for Long-Run Profitability , 1999 .

[35]  Gary J. Russell,et al.  A Relationship between Market Share Elasticities and Brand Switching Probabilities , 1998 .

[36]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[37]  Peter E. Rossi,et al.  A Bayesian Approach to Estimating Household Parameters , 1993 .

[38]  Kenneth P. Yusko,et al.  Determining the Appropriate Correction when the Type of Range Restriction is Unknown: Developing a Sample-Based Procedure , 1991 .