Methods of high-level data exploration capable of robustness in the face of noise found within microarray data are few and far between. Solutions making use of all original features to derive cluster structure can be misleading while those that rely on a trivial feature selection can miss important characteristics. We present a method adopted from previous work in the field of geography (Guo et al, 2003) relying upon conditional entropy between pairs of dimensions to uncover underlying, native cluster structure within a dataset. Applied to an artificially clustered data set, this method performed well though some sensitivity to multiplicative noise was in evidence. When applied to gene expression data, the method produced a clear representation of the underlying data structure. Introduction Our standard microarray dataset consists of ns observations, samples, or experiments in ng genes or dimensions. So our data matrix is ns rows by ng columns. ng is typically much larger than ns. Microarray data sets are, by their nature, highdimensional data sets complicating the analysis of results and restricting one’s ability to perform initial, high-level data exploration (i.e. it’s rather difficult to visually inspect 8,000 dimensions). Even if this were tenable, useful clusters that exist across all available dimensions are exceedingly rare. There is no shortage of methods available for feature selection in an attempt to prune down the number of genes to a ‘best’ set for visualizing the cluster structure within the data. This, however, presupposes that there is a single ‘best’ collection of genes useful for visualizing all subspace clusters which is not often the case (Getz et al, 2000). A more useful means for detecting subspace cluster structure would make use of only those genes useful in assessing the cluster structure for a particular sub-set of dimensions. It is this attribute which makes conditional entropy such a useful tool for high-level data analysis. If entropy is the amount of information provided by the outcome of a random variable, then conditional entropy can be defined as the amount of information about the outcome of one random variable provided by the outcome of a second random variable. We can make use of this measure of shared information in exploring a series of dimensions and observations on those dimensions (Cheng, C. et al, 1999). For any given data set about which, perhaps, we know very little, we can make use of conditional entropy to discover clusters of dimensions. This information can then be used to inform and to tailor downstream analyses to the structure inherent in the data set to be analyzed. Guo et al. 21003 use this technique to discover latent, unexpected clusters on dimensional subspaces. Their model data sets from the field consist of a limited number of dimensions (e.g. several different types of geological measurements) and a very large number of observations on those dimensions (e.g. in excess of ten thousand). Guo et al use this technique to discover latent, unexpected cluster structure between their measurement dimensions.
[1]
Yi Zhang,et al.
Entropy-based subspace clustering for mining numerical data
,
1999,
KDD '99.
[2]
Mark Gahegan,et al.
Opening the black box: interactive hierarchical clustering for multivariate spatial patterns
,
2002,
GIS '02.
[3]
G. Getz,et al.
Coupled two-way clustering analysis of gene microarray data.
,
2000,
Proceedings of the National Academy of Sciences of the United States of America.
[4]
J. Mesirov,et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
,
1999,
Science.