Notation and Conventions

A dataset as a collection of d-tuples (a d-tuple is an ordered list of d elements). Tuples differ from vectors, because we can always add and subtract vectors, but we cannot necessarily add or subtract tuples. There are always N items in any dataset. There are always d elements in each tuple in a dataset. The number of elements will be the same for every tuple in any given tuple. Sometimes we may not know the value of some elements in some tuples. We use the same notation for a tuple and for a vector. Most of our data will be vectors. We write a vector in bold, so x could represent a vector or a tuple (the context will make it obvious which is intended). The entire data set is {x}. When we need to refer to the i'th data item, we write x i. Assume we have N data items, and we wish to make a new dataset out of them; we write the dataset made out of these items as {x i } (the i is to suggest you are taking a set of items and making a dataset out of them). If we need to refer to the j'th component of a vector x i , we will write x (j) i (notice this isn't in bold, because it is a component not a vector, and the j is in parentheses because it isn't a power). Vectors are always column vectors. Terms: • mean ({x}) is the mean of the dataset {x} (definition 1, page 11). • std (x) is the standard deviation of the dataset {x} (definition 2, page 12). • var ({x}) is the standard deviation of the dataset {x} (definition 3, page 16). • median ({x}) is the standard deviation of the dataset {x} (definition 4, page 17). • percentile({x}, k) is the k% percentile of the dataset {x} (definition 5, page 18). • iqr{x} is the interquartile range of the dataset {x} (definition 7, page 18). • {ˆx} is the dataset {x}, transformed to standard coordinates (definition 8, page 23). • corr ({(x, y)}) is the correlation between two components x and y of a dataset (definition 1, page 51).