Judgment under uncertainty: The robust beauty of improper linear models in decision making

Proper linear models are those in which predictor variables are given weights in such a way that the resulting linear composite optimally predicts some criterion of interest; examples of proper linear models are standard regression analysis, discriminant function analysis, and ridge regression analysis. Research summarized in Paul Meehl's book on clinical versus statistical prediction—and a plethora of research stimulated in part by that book—all indicates that when a numerical criterion variable (e.g., graduate grade point average) is to be predicted from numerical predictor variables, proper linear models outperform clinical intuition. Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge's predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors. In fact, unit (i.e., equal) weighting is quite robust for making such predictions. The article discusses, in some detail, the application of unit weights to decide what bullet the Denver Police Department should use. Finally, the article considers commonly raised technical, psychological, and ethical resistances to using linear models to make important social decisions and presents arguments that could weaken these resistances. Paul MeehPs (1954) book Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence appeared 25 years ago. It reviewed studies indicating that the prediction of numerical criterion variables of psychological interest (e.g., faculty ratings of graduate students who had just obtained a PhD) from numerical predictor variables (e.g., scores on the Graduate Record Examination, grade point averages, ratings of letters of recommendation) is better done by a proper linear model than by the clinical intuition of people presumably skilled in such prediction. The point of this article is to review evidence that even improper linear models may be superior to clinical predictions. Vol. 34, No. 7,571-582 A proper linear model is one in which the weights given to the predictor variables are chosen in such a way as to optimize the relationship between the prediction and the criterion. Simple regression analysis is the most common example of a proper linear model; the predictor variables are weighted in such a way as to maximize the correlation between the subsequent weighted composite and the actual criterion. Discriminant function analysis is another example of a proper linear model; weights are given to the predictor variables in such a way that the resulting linear composites maximize the discrepancy between two or more groups. Ridge regression analysis, another example (Darlington, 1978; Marquardt & Snee, 1975), attempts to assign weights in such a way that the linear composites correlate maximally with the criterion of interest in a new set of data. Thus, there are many types of proper linear models and they have been used in a variety of contexts. One example (Dawes, 1971) was presented in this Journal; it involved the prediction of faculty ratings of graduate students. All graduWork on this article was started at the University of Oregon and Decision Research, Inc., Eugene, Oregon; it was completed while I was a James McKeen Cattell Sabbatical Fellow at the Psychology Department at the University of Michigan and at the Research Center for Group Dynamics at the Institute for Social Research there, I thank all these institutions for their assistance, and I especially thank my friends at them who helped. This article is based in part on invited talks given at the American Psychological Association (August 1977), the University of Washington (February 1978), the Aachen Technological Institute (June 1978), the University of Groeningen (June 1978), the University of Amsterdam (June 1978), the Institute for Social Research at the University of Michigan (September 1978), Miami University, Oxford, Ohio (November 1978), and the University of Chicago School of Business (January 1979). I received valuable feedback from most of the audiences. Requests for reprints should be sent to Robyn M. Dawes, Department of Psychology, University of Oregon, Eugene, Oregon 97403. AMERICAN PSYCHOLOGIST • JULY 1979 • 571 Copyright 1979 by the American Psychological Association, Inc. 0003-066X/79/3407-0571$00.75 ate students at the University of Oregon's Psychology Department who had been admitted between the fall of 1964 and the fall of 1967—and who had not dropped out of the program for nonacademic reasons (e.g., psychosis or marriage)— were rated by the faculty in the spring of 1969; faculty members rated only students whom they felt comfortable rating. The following rating scale was used: S, outstanding; 4, above average; 3, average; 2, below average; 1, dropped out of the program in academic difficulty. Such overall ratings constitute a psychologically interesting criterion because the subjective impressions of faculty members are the main determinants of the job (if any) a student obtains after leaving graduate school. A total of 111 students were in the sample; the number of faculty members rating each of these students ranged from 1 to 20, with the mean number being 5.67 and the median being 5. The ratings were reliable. (To determine the reliability, the ratings were subjected to a oneway analysis of variance in which each student being rated was regarded as a treatment. The resulting between-treatments variance ratio (»j) was .67, and it was significant beyond the .001 level.) These faculty ratings were predicted from a proper linear model based on the student's Graduate Record Examination (GRE) score, the student's undergraduate grade point average (GPA), and a measure of the selectivity of the student's undergraduate institution. The cross-validated multiple correlation between the faculty ratings and predictor variables was .38. Congruent with Meehl's results, the correlation of these latter faculty ratings with the average rating of the people on the admissions committee who selected the students was .19; 2 that is, it accounted for one fourth as much variance. This example is typical of those found in psychological research in this area in that (a) the correlation with the model's predictions is higher than the correlation with clinical prediction, but (b) both correlations are low. These characteristics often lead psychologists to interpret the findings as meaning that while the low correlation of the model indicates that linear modeling is deficient as a method, the even lower correlation of the judges indicates only that the wrong judges were used. An improper linear model is one in which the weights are chosen by some nonoptimal method. They may be chosen to be equal, they may be chosen on the basis of the intuition of the person making the prediction, or they may be chosen at random. Nevertheless, improper models may have great utility. When, for example, the standardized GREs, GPAs, and selectivity indices in the previous example were weighted equally, the resulting linear composite correlated .48 with later faculty rating. Not only is the correlation of this linear composite higher than that with the clinical judgment of the admissions committee (.19), it is also higher than that obtained upon cross-validating the weights obtained from half the sample. An example of an improper model that might be of somewhat more interest—at least to the general public—was motivated by a physician who was on a panel with me concerning predictive systems. Afterward, at the bar with his' wife and me, he said that my paper might be of some interest to my colleagues, but success in graduate school in psychology was not of much general interest: "Could you, for example, use one of your improper linear models to predict how well my wife and I get along together?" he asked. I realized that I could—or might. At that time, the Psychology Department at the University of Oregon was engaged in sex research, most of which was behavioristically oriented. So the subjects of this research monitored when they made love, when they had fights, when they had social engagements (e.g., with in-laws), and so on. These subjects also made subjective ratings about how happy they were in their marital or coupled situation. I immediately thought of an improper linear model to predict self-ratings of marital happiness: rate of lovemaking minus rate of fighting. My colleague John Howard had collected just such data on couples when he was an undergraduate at the University of Missouri—Kansas City, where he worked with Alexander (1971). After establishing the intercouple reliability of judgments of lovemaking and fighting, Alexander had one partner from each of 42 couples monitor these events. She allowed us to analyze her data, with the following results: "In the thirty happily married ^This index was based on Cass and Birnbaum's (1968) rating of selectivity given at the end of their book Comparative Guide to American Colleges. The verbal categories of selectivity were given numerical values according to the following rale: most selective, 6; highly selective, 5; very selective (+), 4; very selective, 3; selective, 2 ; not mentioned, 1. Unfortunately, only 23 of the 111 students could be used in this comparison because the rating scale the admissions committee used changed slightly from year to year. 572 • JULY 1979 • AMERICAN PSYCHOLOGIST

[1]  H. A. Wallace,et al.  What is in the Corn Judge's Mind?1 , 1923 .

[2]  S. S. Wilks Weighting systems for linear functions of correlated variables when there is no dependent variable , 1938 .

[3]  Paul E. Meehl,et al.  Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence , 1996 .

[4]  K. R. Hammond Probabilistic functioning and the clinical method. , 1955, Psychological review.

[5]  P. Hoffman The paramorphic representation of clinical judgment. , 1960, Psychological bulletin.

[6]  D. B. Yntema,et al.  Man-Computer Cooperation in Decisions Requiring Common Sense , 1961 .

[7]  E. H. Bowman Consistency and Optimality in Managerial Decision Making , 1963 .

[8]  L. R. Goldberg,et al.  DIAGNOSTICIANS VS. DIAGNOSTIC SIGNS: THE DIAGNOSIS OF PSYCHOSIS VS. NEUROSIS FROM THE MMPI. , 1965, Psychological monographs.

[9]  Paul E. Meehl Seer over sign: The first good example. , 1965 .

[10]  J. Sawyer,et al.  Measurement and prediction, clinical and statistical. , 1966, Psychological bulletin.

[11]  W. Beaver Financial Ratios As Predictors Of Failure , 1966 .

[12]  R. Holt Yet another look at clinical and statistical prediction: or, is clinical psychology worthwhile? , 1970, The American psychologist.

[13]  Lewis R. Goldberg,et al.  Man versus model of man: A rationale, plus some evidence, for a method of improving on clinical inferences. , 1970 .

[14]  Paul Slovic,et al.  Comparison of Bayesian and Regression Approaches to the Study of Information Processing in Judgment. , 1971 .

[15]  N. Wiggins,et al.  Man versus model of man revisited: The forecasting of graduate school success. , 1971 .

[16]  R. Dawes A case study of graduate admissions: Application of three principles of human decision making. , 1971 .

[17]  Frank L. Schmidt,et al.  The Relative Efficiency of Regression and Simple Unit Predictor Weights in Applied Differential Psychology , 1971 .

[18]  John G. Claudy A Comparison of Five Variable Weighting Procedures , 1972 .

[19]  D. Krantz Measurement Structures and Psychological Laws , 1972, Science.

[20]  Hillel J. Einhorn,et al.  Expert measurement and mechanical combination , 1972 .

[21]  Lewis R. Goldberg,et al.  Parameters of personality inventory construction and utilization: A comparison of prediction strategies and tactics. , 1972 .

[22]  E. Deakin Discriminant Analysis Of Predictors Of Business Failure , 1972 .

[23]  R. Dawes,et al.  Linear models in decision making. , 1974 .

[24]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[25]  Ward Edwards,et al.  1 – PUBLIC VALUES: MULTIATTRIBUTE-UTILITY MEASUREMENT FOR SOCIAL DECISION MAKING , 1975 .

[26]  R M Dawes,et al.  Graduate admission variables and future success. , 1975, Science.

[27]  R. Snee,et al.  Ridge Regression in Practice , 1975 .

[28]  R. Hogarth,et al.  Unit weighting schemes for decision making , 1975 .

[29]  K. R. Hammond,et al.  Science, values, and human judgment. , 1976, Science.

[30]  L. R. Goldberg Man versus model of man: just how conflicting is that evidence? , 1976 .

[31]  W. Edwards How to Use Multi-Attribute Utility Measurement for Social Decision Making , 1976 .

[32]  Three Steps Toward Robust Regression. , 1976 .

[33]  Robert Libby,et al.  Man versus model of man: some conflicting evidence , 1976 .

[34]  R. Dawes,et al.  Linear Prediction of Marital Happinessl , 1976 .

[35]  H. Wainer,et al.  Three steps towards robust regression , 1976 .

[36]  Howard Wainer,et al.  Estimating Coefficients in Linear Models: It Don't Make No Nevermind , 1976 .

[37]  J. S. Edwards,et al.  Marriage: Direct and continuous measurement , 1977 .

[38]  Toward a Linear Prediction Model of Marital Happiness , 1977 .

[39]  Ward Edwards Technology for Director Dubious: Evaluation and Decision in Public Contexts , 1977 .

[40]  B. Green Parameter Sensitivity In Multivariate Methods. , 1977, Multivariate behavioral research.

[41]  R. B. Darlington Reduced-variance regression. , 1978, Psychological bulletin.

[42]  R. Hogarth,et al.  Confidence in judgment: Persistence of the illusion of validity. , 1978 .

[43]  W E Remus,et al.  Unit And Random Linear Models In Decision Making. , 1978, Multivariate behavioral research.