Editorial: Understanding and Using the Power of Outliers in Statistical Modeling
暂无分享,去创建一个
The purpose of this editorial is to clarify the misnomers regarding outliers and their usefulness in statistical modeling. In many cases, outliers provide much needed insights into the actual relationships that influence the data being modeled. They are particularly useful in modeling consumer behavior where abnormalities between individual responses are more of a common occurrence than in other data sets. Unfortunately, many people who lack statistical acumen discard such abnormalities, thinking they are irrelevant and thus add no real value. As a result, they proceed to average them out, thinking that this way they will improve the structural integrity of the data being modeled. When in fact, they are discarding valuable information that can be utilized to uncover insights, otherwise missed as a result of elimination. Hopefully, this editorial will put to rest any questions one may have about outliers, and how they are treated or should be treated in statistical modeling. This may become the standard reference on outliers in the future. WHAT IS CONSIDERED AN OUTLIER? The outlier is an unusual value either too low or too high. In one particular period, sales went up sharply because the company got a one time large order from Saudi Arabia, a strike at the competitor's plant, or for any other reason. In another period, sales went down sharply because of a riot, hurricane or strike at the plant. People who are not statistically oriented may not consider these outliers as providing key insights into the data. They often assume that all the outliers are the same and they need to be discarded as they tend to distort the results of the analysis. Actual outliers in most cases are a result of data entry errors or a mistake in coding. These types of outliers are classified as problematic, meaning they are not representative of the population and are counter to the objectives of the analysis. Problematic outliers can seriously distort the statistical results of the model, thereby affecting the structural integrity of the statistical parameters. Most novice users of statistical modeling think that all outliers fall into this category, when in fact they should be identified in the data cleansing phase. However, if overlooked, these types of abnormalities will be eliminated or recoded as missing values. Unfortunately, most inexperienced users of statistical modeling categorize all outliers as problematic and believe that they should be treated accordingly (i.e., they should be eliminated from the data completely). They also make the assumption that all experienced statisticians follow the same thought process, that is, eliminate outliers. The truth of the matter is that almost all outliers fall into the category of extraordinary events (or responses), and experienced statisticians refer to as "influential observations." In statistical modeling, sometimes one or more of the observations have a strong influence on the results of the data being measured. Since influential observations can have such a dramatic effect on the estimated regression equation, it is important that they are examined carefully with several techniques - scatter plots and statistical diagnostics. The first check is to make sure that no error has been made in collecting or recording the data. If an error has occurred, it can be corrected, and a new estimated regression equation developed. On the other hand, if the observation (s) is valid, we should consider ourselves fortunate. Such a point (s), if valid, can contribute to abetter understanding of the appropriate model and can lead to a better estimated regression equation. Thus, one should never throw out an outlier (influential observation) without further statistical analysis. It is the belief of most statisticians, including myself, that those influential observations (outliers) should never be discarded or averaged out of the data being examined. Furthermore, scatter plots alone are not enough, as many influential observations can "not" be detected by viewing such graphs because they may be buried in the data. …