Functional data analysis for environmental and biomedical problems

To efficiently extract implicit patterns from datasets, data mining methods are beneficial tools for analyzing large and complicated as well as small and scarce data. Despite the great potential of applying data mining methods to complicated data, the appropriate methods remain premature and insufficient. The major aim of this dissertation is to present some data mining methods, along with the real data, as a tool for analyzing the complex behavior of functional data. In the first part, this dissertation presents a data mining application to: (1) Identify an efficient way to characterize the spatial variations of PM2.5 concentrations based solely upon their temporal patterns, and (2) Analyze the temporal and seasonal patterns of PM2.5 concentrations in spatially homogenous regions. This study used 24-hour average PM2.5 concentrations measured every third day during the period between 2001 and 2005 at 522 monitoring sites in the continental United States. A k-means clustering algorithm using the correlation distance was employed to investigate the similarity in patterns between temporal profiles observed at the monitoring sites. A k-means clustering analysis produced six clusters of sites with distinct temporal patterns which were able to identify and characterize spatially homogeneous regions of the United States. The study also presents a rotated principal component analysis (RPCA) that has been used for characterizing spatial patterns of air pollution and discusses the difference between the clustering algorithm and RPCA. Data mining application for investigating the behavior of ozone concentration will be presented in the followed chapter. Ozone has been known to be associated with human health. Ozone data are generally collected over a long period of time from interested locations. However, constructing ozone monitoring sites may not possible or cost effective due to some limitations such as hazardous environment or inaccessible area. The objective of this present study is: (1) To interpolate ozone concentrations as a functional response at an unsampled location, and (2) To reduce model complexity by constructing a data compression and reduction model which achieve the highest accuracy as much as possible. This study used daily maximum 8-hour ozone concentrations between 2003 and 2006 at 14 monitoring sites in Dallas-Fort Worth area. Wavelet decomposition broke down the data into multiscale data analysis. Regression Analysis was used as a data compression method. Kriging was applied as a spatial interpolation. In addition, model refining step helped tune the ozone concentration with different variability. This study reveals that our model can achieve up to 6.99 ppb in mean absolute error (MAE) and 9.76 ppb in mean absolute error for high ozone day (MAE75). Finally, an efficient strategy for classification of prostate cancer in near infrared spectra is illustrated. Prostate cancer is the most common male cancer and the second leading cause of cancer death in the United States. The main purpose of this study is to develop an efficient tool that classifies the near infrared (NIR) spectroscopic data taken from ex vivo human prostate glands as normal or cancer. Our proposed procedure consists of several steps. First, to ensure the comparability between spectra, normalization was done by dividing each spectral point by the area of the total intensity of the spectrum. Second, clustering analysis was performed with these normalized spectra to separate the spectra that represent the normal pattern from a mixed group that contains both normal and tumor spectra. Third, we conducted two-stage classification, the first being an effort to construct a classification model with the labels obtained from the preceding clustering analysis and the second being a classification to focus on the mixed group classified from the first classification model. To increase the accuracy, the second classification model was constructed based on the selected features that capture important characteristics of the spectral data. Our proposed procedure was evaluated by its classification ability in testing samples using a leave-one-out cross validation technique, yielding an accuracy of 90%. (Abstract shortened by UMI.)

[1]  Stefan Tilmes,et al.  Investigation on the spatial scales of the variability in measured near‐ground ozone mixing ratios , 1998 .

[2]  W. Malm,et al.  Spatial and seasonal trends in particle concentration and optical extinction in the United States , 1994 .

[3]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[4]  Spatial and Temporal Patterns in Particle Data Measured During the MOHAVE Study , 1997 .

[5]  A. Leung,et al.  Prediction of maximum daily ozone level using combined neural network and statistical characteristics. , 2003, Environment international.

[6]  Yutaka Fukuda,et al.  Non‐destructive Visible/NIR Spectroscopy for Differentiation of Fresh and Frozen‐thawed Fish , 2005 .

[7]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[8]  J. Manola,et al.  6-month androgen suppression plus radiation therapy vs radiation therapy alone for patients with clinically localized prostate cancer: a randomized controlled trial. , 2004, JAMA.

[9]  A. Zietman,et al.  Localized carcinoma of the prostate (Stages T1B, T1C, T2, and T3). Review of management with external beam radiation therapy , 1993, Cancer.

[10]  R. Burnett,et al.  Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. , 2002, JAMA.

[11]  Roman M. Balabin,et al.  Comparison of linear and nonlinear calibration models based on near infrared (NIR) spectroscopy data for gasoline properties prediction , 2007 .

[12]  J. Faraway Regression analysis for a functional response , 1997 .

[13]  P. Kasibhatla,et al.  Growth of Continental-Scale Metro-Agro-Plexes, Regional Ozone Pollution, and World Food Production , 1994, Science.

[14]  J. Salmond Wavelet analysis of intermittent turbulence in a very stable nocturnal boundary layer: implications for the vertical mixing of ozone , 2005 .

[15]  R Richards-Kortum,et al.  Reflectance spectroscopy for in vivo characterization of ovarian tissue , 2001, Lasers in surgery and medicine.

[16]  Xavier Emery Ordinary multigaussian kriging for mapping conditional probabilities of soil properties , 2006 .

[17]  David T. Allen,et al.  Daily, Seasonal, and Spatial Trends in PM2.5 Mass and Composition in Southeast Texas Special Issue of Aerosol Science and Technology on Findings from the Fine Particulate Matter Supersites Program , 2004 .

[18]  Thomas J. Smith,et al.  American Society of Clinical Oncology recommendations for the initial hormonal management of androgen-sensitive metastatic, recurrent, or progressive prostate cancer. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[19]  G. K. Matafonov,et al.  The periodic spatial–temporal characteristics variations of the total ozone content , 2005 .

[20]  Roger Woodard,et al.  Interpolation of Spatial Data: Some Theory for Kriging , 1999, Technometrics.

[21]  John,et al.  Nonparametric Simple Regression: Smoothing Scatterplots , 2000 .

[22]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Gabriel Huerta,et al.  A spatiotemporal model for Mexico City ozone levels , 2004 .

[24]  Roland Bobbink,et al.  Impacts of tropospheric ozone and airborne nitrogenous pollutants on natural and semi‐natural ecosystems: a commentary , 1998 .

[25]  William C. Malm,et al.  Spatial and monthly trends in speciated fine particle concentration in the United States , 2004 .

[26]  P. Mcmurry,et al.  Particulate matter science for policy makers : a NARSTO assessment , 2004 .

[27]  I. Daubechies Orthonormal bases of compactly supported wavelets , 1988 .

[28]  Timothy F. Donahue,et al.  Watchful waiting and factors predictive of secondary treatment of localized prostate cancer. , 2004, The Journal of urology.

[29]  J. Roger,et al.  Application of LS-SVM to non-linear phenomena in NIR spectroscopy: development of a robust and portable sensor for acidity prediction in grapes , 2004 .

[30]  H. Bayraktar,et al.  A Kriging-based approach for locating a sampling site—in the assessment of air quality , 2005 .

[31]  M. Sánchez,et al.  A feasibility study on the use of a miniature fiber optic NIR spectrometer for the prediction of volumic mass and reducing sugars in white wine fermentations , 2008 .

[32]  Naoki Saito,et al.  Simultaneous noise suppression and signal compression using a library of orthonormal bases and the minimum-description-length criterion , 1994, Defense, Security, and Sensing.

[33]  Laurie A. McNair,et al.  Spatial inhomogeneity in pollutant concentrations, and their implications for air quality model evaluation , 1996 .

[34]  M. Schwanninger,et al.  NIR PLSR model selection for Kappa number prediction of maritime pine Kraft pulps , 2007, Wood Science and Technology.

[35]  H. D. de Koning,et al.  Short-term effects of population-based screening for prostate cancer on health-related quality of life. , 1998, Journal of the National Cancer Institute.

[36]  W. Malm Characteristics and origins of haze in the continental United States , 1992 .

[37]  M. C. Hubbard,et al.  A Comparison of Nonlinear Regression and Neural Network Models for Ground-Level Ozone Forecasting , 2000, Journal of the Air & Waste Management Association.

[38]  William C. Malm,et al.  A 10‐year spatial and temporal trend of sulfate across the United States , 2002 .

[39]  V. Prybutok,et al.  A neural network model forecasting for prediction of daily maximum ozone concentration in an industrialized urban area. , 1996, Environmental pollution.

[40]  Exploring Spatial Patterns of Particulate Sulfur and OMH from the Project MOHAVE Summer Intensive Regional Network Using Analyses of Variance Techniques and Meteorological Parameters as Sort Determinants , 2000, Journal of the Air & Waste Management Association.

[41]  Thomas Lumley,et al.  Spatial Characteristics of Fine Particulate Matter: Identifying Representative Monitoring Locations in Seattle, Washington , 2002, Journal of the Air & Waste Management Association.

[42]  Carol A. Gotway,et al.  Statistical Methods for Spatial Data Analysis , 2004 .

[43]  Jing-Shiang Hwang,et al.  Site Representativeness of Urban Air Monitoring Stations. , 1996, Journal of the Air & Waste Management Association.

[44]  Roman M. Balabin,et al.  Wavelet neural network (WNN) approach for calibration model building based on gasoline near infrared (NIR) spectra , 2008 .