Information retrieval in hydrochemical data using the latent semantic indexing approach

The latent semantic indexing (LSI) method was applied for the retrieval of similar samples (those samples with a similar composition) in a dataset of groundwater samples. The LSI procedure was based on two steps: (i) reduction of the data dimensionality by principal component analysis (PCA) and (ii) calculation of a similarity between selected samples (queries) and other samples. The similarity measures were expressed as the cosine similarity, the Euclidean and Manhattan distances. Five queries were chosen so as to represent different sampling localities. The original data space of 14 variables measured in 95 samples of groundwater was reduced to the three-dimensional space of the three largest principal components which explained nearly 80% of the total variance. The five most proximity samples to each query were evaluated. The LSI outputs were compared with the retrievals in the orthogonal system of all variables transformed by PCA and in the system of standardized original variables. Most of these retrievals did not agree with the LSI ones, most likely because both systems contained the interfering data noise which was not preliminary removed by the dimensionality reduction. Therefore the LSI approach based on the noise filtration was considered to be a promising strategy for information retrieval in real hydrochemical data.

[1]  J. Bobbin,et al.  Knowledge discovery for prediction and explanation of blue-green algal dynamics in lakes by evolutionary algorithms , 2001 .

[2]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[3]  Christine L Russom Mining environmental toxicology information: web resources. , 2002, Toxicology.

[4]  I. Jolliffe Principal Component Analysis , 2002 .

[5]  S. Soyupak,et al.  Case studies on the use of neural networks in eutrophication modeling , 2000 .

[6]  Kwok-wing Chau,et al.  A three-dimensional eutrophication modeling in Tolo Harbour , 2004 .

[7]  Michele Scardi,et al.  Developing an empirical model of phytoplankton primary production: a neural network case study , 1999 .

[8]  F. Recknagel,et al.  Artificial neural network approach for modelling and prediction of algal blooms , 1997 .

[9]  Holger R. Maier,et al.  Forecasting cyanobacterium Anabaena spp. in the River Murray, South Australia, using B-spline neurofuzzy models , 2001 .

[10]  Friedrich Recknagel,et al.  Prediction and elucidation of phytoplankton dynamics in the Nakdong River (Korea) by means of a recurrent artificial neural network , 2001 .

[11]  Michele Scardi,et al.  Advances in neural network modeling of phytoplankton primary production , 2001 .

[12]  David R. B. Stockwell,et al.  Improving ecological niche models by data mining large environmental datasets for surrogate models , 2005, ArXiv.

[13]  Sovan Lek,et al.  Utilisation of non-supervised neural networks and principal component analysis to study fish assemblages , 2001 .

[14]  Friedrich Recknagel,et al.  Applications of machine learning to ecological modelling , 2001 .

[15]  Nitin Muttil,et al.  Genetic programming for analysis and real-time prediction of coastal algal blooms , 2005 .

[16]  Chenghu Zhou,et al.  A data-mining approach to determine the spatio-temporal relationship between environmental factors and fish distribution , 2004 .

[17]  Laurent Bertino,et al.  Process identification by principal component analysis of river water-quality data , 2001 .

[18]  T. Maekawa,et al.  Use of artificial neural network in the prediction of algal blooms. , 2001, Water research.

[19]  Bhavani Thuraisingham,et al.  Data Mining: Technologies, Techniques, Tools, and Trends , 1998 .

[20]  Pavel Praks,et al.  On SVD-Free Latent Semantic Indexing for Iris Recognition of Large Databases , 2007 .

[21]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[22]  R. Thomann,et al.  Principles of surface water quality modeling and control , 1987 .

[23]  Qiuwen Chen,et al.  Integration of data mining techniques and heuristic knowledge in fuzzy logic modelling of eutrophication in Taihu Lake , 2003 .

[24]  Karin Viergever,et al.  Knowledge discovery from models of soil properties developed through data mining , 2006 .

[25]  Yan Huang,et al.  Neural network modelling of coastal algal blooms , 2003 .

[26]  Holger R. Maier,et al.  Use of artificial neural networks for modelling cyanobacteria Anabaena spp. in the River Murray, South Australia , 1998 .

[27]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[28]  Y. Cheung,et al.  Forecasting of Dissolved Oxygen in Marine Fish Culture Zone , 1991 .

[29]  John M. Chambers,et al.  Graphical Methods for Data Analysis , 1983 .

[30]  T. Tadesse,et al.  A new approach for predicting drought-related vegetation stress: Integrating satellite, climate, and biophysical data over the U.S. central plains , 2005 .

[31]  Y. Cheung,et al.  Dissolved Oxygen Variations in Marine Fish Culture Zone , 1991 .

[32]  Nitin Muttil,et al.  Machine-learning paradigms for selecting ecologically significant input variables , 2007, Eng. Appl. Artif. Intell..

[33]  Kwok-Wing Chau,et al.  A Split-Step PSO Algorithm in Prediction of Water Quality Pollution , 2005, ISNN.

[34]  Peter A. Whigham,et al.  Comparative application of artificial neural networks and genetic algorithms for multivariate time-series modelling of algal blooms in freshwater lakes , 2002 .

[35]  Ronald D. Anderson,et al.  A Deterministic Ecological Risk Assessment for Copper in European Saltwater Environments , 1999 .

[36]  K. W. Chau,et al.  Eutrophication Model for a Coastal Bay in Hong Kong , 1998 .

[37]  Peter A. Whigham,et al.  Modelling Microcystis aeruginosa bloom dynamics in the Nakdong River by means of evolutionary computation and statistical approach , 2003 .

[38]  W. Härdle,et al.  Applied Multivariate Statistical Analysis , 2003 .

[39]  採編典藏組 Society for Industrial and Applied Mathematics(SIAM) , 2008 .

[40]  Pavel Praks,et al.  Information extraction from HTML product catalogues: from source code and images to RDF , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[41]  K. Chau,et al.  Eutrophication Studies on Tolo Harbour, Hong Kong , 1992 .

[42]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[43]  Kwok-wing Chau,et al.  Field measurements of SOD and sediment nutrient fluxes in a land-locked embayment in Hong Kong , 2002 .

[44]  Shu Tao,et al.  Marine coastal ecosystem health assessment: a case study of the Tolo Harbour, Hong Kong, China , 2004 .

[45]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..