Basic Statistical Principles and Diagnostic Tree

No one is born a data miner. In order to grow expertise as a data miner or as an information analyst, you need to obtain certain basic knowledge. Then you need data to mine, as well as a way to measure the important characteristics of a process or phenomenon, so you can employ the appropriate statistical tools. Measuring doesn’t necessarily mean using a ruler, calipers, or a scale. It can also be simply a “yes” or “no” decision. Statisticians and data miners typically categorize data as follows: Variables data represent actual measured quantities, such as weights, dimensions, temperatures, proportions, and the like. The measurements have units associated with them (for example, inches, pounds, degrees Fahrenheit, and centimeters). Because variables data may take on any value within a certain range (subject to the precision of the measuring instrument), these observations are sometimes said to be continuous. Attributes data on the other hand, represent the classification of measurements into one of two categories (such as “defective” or “nondefective”) or the number of occurrences of some phenomenon (such as the number of airplanes that arrive at an airport each hour). In most cases, these types of observations can only assume integer values, so attributes data are said to be discrete.