Data Mining and Scientific Data

Abstract Data mining—the discovery of previously unknown information from a large collection of individual data sources—is becoming increasingly popular for scientific data archives. We describe an approach to data mining that uses spatial, temporal, and type constraints to obtain a broad list of data that are potentially related to a data set of interest. Tree- and spline-based multivariate regression and classification techniques are then used to identify functional relationships between the data. Expert knowledge is used to constrain and guide the model building and evaluation process. We demonstrate the approach by identifying relationships between indicators in a state of the Antarctic environment reporting database. Analyses of the fuel usage of electrical generators and boilers at Australia's Davis station yielded fuel usage dependencies on air temperature and wind speed that were in good accordance with known physical processes. The phenomenon of periodic haul-outs of large numbers of leopard seals on Macquarie Island was related to anomalies in regional sea ice cover and sea surface temperature.

[1]  Thomas M. Smith,et al.  An Improved In Situ and Satellite SST Analysis for Climate , 2002 .

[2]  Padhraic Smyth,et al.  Statistical inference and data mining , 1996, CACM.

[3]  A. Atkinson Subset Selection in Regression , 1992 .

[4]  Pat Langley,et al.  Discovering Communicable Scientific Knowledge from Spatio-Temporal Data , 2001, ICML.

[5]  D. Pearson,et al.  Detecting and Modeling Spatial and Temporal Dependence in Conservation Biology , 2000, Conservation biology : the journal of the Society for Conservation Biology.

[6]  C. H. R I S T O P H E R P O T T E R,et al.  Major Disturbance Events in Terrestrial Ecosystems Detected Using Global Satellite Data Sets , 2003 .

[7]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[8]  Heikki Mannila,et al.  Theoretical frameworks for data mining , 2000, SKDD.

[9]  J. Friedman Multivariate adaptive regression splines , 1990 .

[10]  B. Raymond,et al.  Predicting seabirds at sea in the Southern Indian Ocean , 2003 .

[11]  Jerome H. Friedman,et al.  DATA MINING AND STATISTICS: WHAT''S THE CONNECTION , 1997 .

[12]  Michael K. Ng,et al.  Data-Mining Massive Time Series Astronomical Data Sets - A Case Study , 1998, PAKDD.

[13]  Yan Huang,et al.  Correlation Analysis of Spatial Time Series Datasets: A Filter-and-Refine Approach , 2003, PAKDD.

[14]  M. Stone Asymptotics for and against cross-validation , 1977 .

[15]  P. Dixon,et al.  Accounting for Spatial Pattern When Modeling Organism- Environment Interactions , 2022 .

[16]  Colin Davis,et al.  State of the environment reporting: an Antarctic case study , 2003, Polar Record.

[17]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[18]  Surajit Chaudhuri Data Mining and Database Systems: Where is the Intersection? , 1998, IEEE Data Eng. Bull..

[19]  Thomas M. Smith,et al.  Extended Reconstruction of Global Sea Surface Temperatures Based on COADS Data (1854–1997) , 2003 .

[20]  Tom M. L. Wigley,et al.  A Bivariate Time Series Approach to Anthropogenic Trend Detection in Hemispheric Mean Temperatures , 2003 .

[21]  Jiawei Han,et al.  Discovery of Spatial Association Rules in Geographic Information Databases , 1995, SSD.

[22]  S. Chown,et al.  Life at the front: history, ecology and change on southern ocean islands. , 1999, Trends in ecology & evolution.

[23]  Edwin P. D. Pednault,et al.  Probabilistic estimation-based data mining for discovering insurance risks , 1999, IEEE Intell. Syst..

[24]  John F. Roddick,et al.  A bibliography of temporal, spatial and spatio-temporal data mining research , 1999, SKDD.

[25]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[26]  D. Rounsevell,et al.  Leopard Seals, Hyrurga leptonyx (Pinnipedia), at Macquarie Island from 1949 to 1979 , 1980 .

[27]  R. Tibshirani,et al.  Flexible Discriminant Analysis by Optimal Scoring , 1994 .

[28]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[29]  W. White,et al.  An Antarctic circumpolar wave in surface pressure, wind, temperature and sea-ice extent , 1996, Nature.

[30]  David J. Hand,et al.  Statistics and data mining: intersecting disciplines , 1999, SKDD.

[31]  David Haussler,et al.  Mining scientific data , 1996, CACM.

[32]  David M. Rocke,et al.  Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data , 2003, Data Mining and Knowledge Discovery.

[33]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[34]  Chris Bailey-Kellogg,et al.  Sampling strategies for mining in data-scarce domains , 2002, Computing in Science & Engineering.