COMBINING MULTIPLE MACHINE LEARNING ALGORITHMS TO PREDICT TAXA UNDER REFERENCE CONDITIONS FOR STREAMS BIOASSESSMENT

In the present study, we tested the potential of combining three machine learning techniques in a bioassessment tool to more accurately predict the pool of expected taxa at a site. This tool, the Hydra, uses the best performing technique from Support Vector Machines (SVM), Multi-layer Perceptron and K-Nearest Neighbour (KNN), to predict the taxa expected at a stream site, and further evaluates the quality of a site, though a classification system based on observed/expected values, similar to that used in River Invertebrate Prediction and Classification System (RIVPACS) models. To test the procedure, we used a dataset composed of 137 training sites, 15 validation sites and 174 test sites (potentially disturbed) from Portuguese streams. The combined use of three machine learning techniques was more effective in the prediction of invertebrate taxa at a site than their individual use. The three methods were always tested for all invertebrate taxa, but from the three techniques tested, SVM and KNN were most often the best performing techniques (the most accurate among the three for a higher number of taxa) in the prediction of invertebrate taxa with the present dataset. The combination of all algorithms implemented in Hydra resulted in good models for stream bioassessment (e.g. SD OE50   0.6, Spearman correlations with global degradation >0.7). We also found no advantage in removing rare taxa from the training dataset, and 50% accuracy is the most adequate accuracy level for calculation of OE ratios through Hydra. Future work should consist of comparing the performance of this technique with others, such as the RIVPACS models, using the same datasets. Considering the flexibility of this technique, self-adjustment and easy implementation through a website (aquaweb.uc.pt), we expect it to be also useful in the prediction of other aquatic elements such as fishes and algae. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[2]  D. Goffaux,et al.  Assessing river biotic condition at a continental scale: a European approach using functional metrics and fish assemblages , 2006 .

[3]  David P. Larsen,et al.  Exclusion of rare taxa affects performance of the O/E index in bioassessments , 2007, Journal of the North American Benthological Society.

[4]  R. Norris,et al.  Water quality assessment of Portuguese streams: Regional or national predictive models? , 2009 .

[5]  John L Stoddard,et al.  Setting expectations for the ecological condition of streams: the concept of reference condition. , 2005, Ecological applications : a publication of the Ecological Society of America.

[6]  Paulo Cortez,et al.  Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool , 2010, ICDM.

[7]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[9]  Peter L. M. Goethals,et al.  Application of classification trees and support vector machines to model the presence of macroinvertebrates in rivers in Vietnam , 2010, Ecol. Informatics.

[10]  DIATMOD: diatom predictive model for quality assessment of Portuguese running waters , 2012, Hydrobiologia.

[11]  T. Reynoldson,et al.  The Reference Condition: A Comparison of Multimetric and Multivariate Approaches to Assess Water-Quality Impairment Using Benthic Macroinvertebrates , 1997, Journal of the North American Benthological Society.

[12]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[13]  B T Hart,et al.  Application of the Australian river bioassessment system (AUSRIVAS) in the Brantas River, East Java, Indonesia. , 2001, Journal of environmental management.

[14]  G. Minshall,et al.  The River Continuum Concept , 1980 .

[15]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[16]  Friedrich Recknagel,et al.  Predictive modelling of macroinvertebrate assemblages for stream habitat assessments in Queensland (Australia) , 2001 .

[17]  David R. B. Stockwell,et al.  ANNA: A new prediction method for bioassessment programs , 2005 .

[18]  Charles P. Hawkins,et al.  Weak correspondence between landscape classifications and stream invertebrate assemblages: implications for bioassessment , 2000, Journal of the North American Benthological Society.

[19]  Kyoung-jae Kim,et al.  Financial time series forecasting using support vector machines , 2003, Neurocomputing.

[20]  M. W Gardner,et al.  Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences , 1998 .

[21]  Young-Seuk Park,et al.  Water quality assessment using diatom assemblages and advanced modelling techniques , 2004 .