Mining scientific data

THE SCIENTIST AT THE OTHER END OF today’s data collection machinery— whether a satellite collecting data from a remote sensing platform, a telescope scanning the skies, or a microscope probing the minute details of a cell—is typically faced with the problem: What do I do with all the data? Scientific instruments can easily generate terabytes and petabytes of data at rates as high as gigabytes per hour. There is a rapidly widening gap between data collection capabilities and the ability to analyze the data. The traditional approach of a lone investigator staring at raw data in pursuit of (often hypothesized) phenomena or underlying structure is quickly becoming infeasible. The root of the problem is that data size and dimensionality are too large. A scientist can work effectively with a few thousand observations, each having a small number of measurements, say five. Effectively digesting millions of data points, each with tens or hundreds of measurements, is another matter. When a problem is fully understood and the scientist knows what to look for in the data through well-defined procedures, data volume can be handled effectively through data reduction.1 By reducing data, a scientist is U s a m a F a y y a d ,