This paper shows how ridge regression and other shrinkage estimates can be used to improve the performance of direct marketing scoring models. It reviews the key property of shrinkage estimates | that they produce more stable estimates having smaller mean squared error than ordinary least squares models. It relates the idea of the e ective number of parameters or degrees of freedom from the smoothing literature to ridge regression. The ridge estimates are shown to t fewer degrees of freedom than the ostensible number of parameters in the model. This means that direct markets can include more variables in a scoring model without danger of over tting the data. Reducing the degrees of freedom by shrinking the estimates is shown to be more stable than dropping variables from the model with, e.g., stepwise regression. These results are corroborated by comparing shrinkage estimates with stepwise regression on two data sets. Improved ways of drawing samples from a database for estimating and validating models are proposed. keywords: scoring models, shrinkage estimation, ridge regression, principal components regression, weight decay 1 Introduction One of the fundamental questions of direct marketing is deciding who should receive a particular o er. Whenever marketers have the ability to target an offer or communication to an individual | not a large group of people such as the mass market | they must judiciously decide who should receive the o er. If a company wants to have a good relationship with its customers, it should not make o ers that are not relevant. This undermines the company's e orts to earn customers' loyalty, build trust, and strengthen the relationship. Also, there is often a high marginal cost for each communication, and sending o ers to people who will not respond is unpro table. For example, some catalog companies spend $5 to send a catalog to a single customer, making it very important that those who receive these catalogs are likely to respond, or at least are highly valued customers. Scoring models can help make this decision. They use other information that a company has about a customer to predict whether or not the customer will be interested in the current o er. These models were originally developed by direct marketers, and an example from direct marketing should make the idea clear. Suppose that a catalog company must decide which of its customers should receive its spring fashion catalog. It has the entire purchase history of its customers, beginning with variables such as recency (how recently the customer has purchased), frequency (number of times the customer has placed an order during some xed time period), and monetary value (amount of money spent during a xed time period). The catalog company sent a spring fashion catalog last year, and recorded which customers made purchases from it and how much they spent. The catalog company could build a scoring model using data from the previous year's spring fashion catalog. It would develop a the model by using the purchase history it had prior to the mailing to predict who actually responded to the mailing. It would then apply the model to the current purchase history to estimate how likely each customer is to respond, and send the catalog to those who are most likely. The estimate of response likelihood or expected purchase amount is called a score and applying the model to the current purchase history is called scoring the database. In general, let vector x(p 1) contain information that a company has on a single customer to whom it is considering sending an o er. In the catalog example, x would contain the previous purchase history. Sometimes scoring models are used to prospect for new customer; in this case the company would not have as much information about the prospect and x might contain only a set of overlayed demographics. The variables represented by x are called predictor variables. Let y be some measure of the customer's response. In the catalog example, y would usually be the amount the customer bought from the previous spring fashion catalog. In other cases, such as a theater company sending a mailing to sell subscriptions to its program, y is a dichotomous variable taking values \respond" (1) or \did not respond" (0). I shall refer to y as the dependent variable. A scoring model, 2 f , tells how the average of y is related to x: E(yjx) = f(x): (1) For a xed value of x, the scoring model tells us the average of the response variable. There has been a lot of research on modeling (1). The reason for this is that the nancial gains from even small improvements can be great. Catalog companies routinely circulate millions of copies of their books. With a circulation of say two million, models yielding even a one-cent improvement in average order size are appreciated. The same is true in many other situations where direct marketing is used such as credit card o ers, solicitations from internet service providers, and solicitations from charities. Many di erent functional forms have been proposed for f (e.g., see Shepard (1995, chs. 12, 13, 16)). Perhaps the most commonly used functional form by practitioners is to model y as a linear function of x, f(x) = x0 (Banslaben, 1992). Hansotia and Wang (1997) use logistic regression. Peltier, Davis, and Schibrowsky (1996) give an example of how discriminant analysis can be used in direct marketing. It could also be a tree-based model tted with, e.g., CHAID or CART (Magidson, 1988). Zahavi and Levin (1997) investigate neural networks. Bult (1993) investigates semi-parametric forms for f . This paper investigates how scoring models can be improved by using an alternative method of estimation, shrinkage estimation. Shrinkage estimation has a long history in many elds, including engineering (e.g., Frank and Friedman, (1993)), statistics (e.g., Neter et al. (1996), economics (e.g., Vinod and Ullah, (1981), machine learning and arti cial intelligence (called the weight-decay method, Rumelhart, et al. (1996, pp. 552{3)), and marketing (e.g., Mahajan, Jain, and Bergier, (1977); Saunders (1987, pp. 12{14); Sharma and James (1981)). But they are not well known in the direct marketing community. This paper summarizes the key properties of shrinkage estimates and shows how the performance of existing scoring models can be improved simply by reestimating them with a shrinkage method such as ridge regression (RR) and principal components regression (PCR). It also proposes a new interpretation of the RR estimates | they are shown to reduce the e ective number of parameters in the model. The usual way to reduce the number of parameters is to drop variables from the model. When it is important do reduce the degrees of freedom (e.g., to prevent over tting), this new interpretation implies that practitioners can reduce the e ective number of parameters without having to drop seemingly important variables from a model. Shrinkage estimation is demonstrated on two direct marketing datasets. 2 Ridge regression This section focuses on one particular type of shrinkage estimate, ridge regression (RR) for the linear regression model. Ridge estimates are also available for other models including logistic regression (Schaefer et al., 1984). A second approach to shrinkage estimation, principal components regression (PCR), is also discussed brie y in this section. The section de nes RR and develops two key properties of its estimates. The rst property, which is covered in most regression texts, is that ridge estimates are more reliable than ordinary least squares (OLS) estimates in that they have smaller mean squared error. This means that on average they will come closer to estimating the true model parameters than the OLS estimates. Because of this property, RR is often applied to problems where there is a large amount of multicollinearity among the predictor variables and the OLS estimates are unstable. The second property is that the e ective number of parameters in a model estimated with RR is smaller than the number of variables in the model. This is important because it gives the modeler a way of reducing the risk of over tting the data without dropping variables, the approach currently favored by many direct marketers. This is established by applying results from the smoothing literature to RR. For these two reasons, along with the fact that commercial software currently o er RR and PCR, these estimates could be attractive to direct marketing modelers. LetX be an n pmatrix scaled so that the columns have mean zero and variance one. The n rows of X contain p measurements on a customer. Assume that 3 p < n and that the rank of X is p. Let y be an n 1 vector also scaled to have mean zero and variance one, containing measurements of the dependent variable, usually demand or customer value for direct marketing models. The linear regression model quanti es the relationship between X and y: y = X + e; where is a p 1 vector of parameters that must be estimated from the data and e is a noise term with mean zero and variance 2. The ordinary least squares (OLS) estimate of is the solution to the least-squares objective function b = argmin b ky Xbk2 = (X0X) 1X0y: 2.1 Reducing mean squared error of b The overall quality of an estimate is usually measured by its mean squared error (MSE) MSE(b) = E[(b )0(b )] = tr[V (b)] + bias(b)0bias(b): The MSE of b tells us, on average, how far the estimate b is from the true value of the parameter ? It can be decomposed into a sum of two terms: the variance and the squared bias of the estimate. The OLS estimate b is an unbiased estimate of , and therefore MSE(b) is simply the variance of b. The Gauss-Markov theorem tells us that b is the best linear unbiased estimate, meaning that it has the lowest variance and MSE among all linear unbiased estimates. But if we also consider biased estimates of , we can do better. This is exactly what RR and PCR do: they modify b by introducing a bias, which
[1]
J. Schibrowsky,et al.
Predicting payment and nonpayment of direct mail obligations: Profiling good and bad credit risks
,
1996
.
[2]
V. Barnett,et al.
Applied Linear Statistical Models
,
1975
.
[3]
A. Hedges,et al.
Predictive modelling—or is it?
,
1991
.
[4]
J. Friedman,et al.
A Statistical View of Some Chemometrics Regression Tools
,
1993
.
[5]
Hrishikesh D. Vinod,et al.
Recent Advances in Regression Methods.
,
1983
.
[6]
Yves Chauvin,et al.
Backpropagation: the basic theory
,
1995
.
[7]
Nissan Levin,et al.
Applying neural computing to target marketing
,
1997
.
[8]
J. Friedman,et al.
Predicting Multivariate Responses in Multiple Linear Regression
,
1997
.
[9]
Ronald A. Thisted,et al.
Elements of statistical computing
,
1986
.
[10]
Jay Magidson,et al.
Improved statistical techniques for response modeling
,
1988
.
[11]
J. Saunders.
The Specification of Aggregate Market Models
,
1987
.
[12]
Behram J. Hansotia,et al.
Analytical challenges in customer acquisition
,
1997
.
[13]
A. Winsor.
Sampling techniques.
,
2000,
Nursing times.
[14]
Jan Roelf Bult,et al.
Semiparametric versus Parametric Classification Models: An Application to Direct Marketing
,
1993
.
[15]
Subhash Sharma,et al.
Latent Root Regression: An Alternate Procedure for Estimating Parameters in the Presence of Multicollinearity
,
1981
.
[16]
R. Schaefer,et al.
A ridge logistic estimator
,
1984
.