Machine learning methods for microbial source tracking

This paper reports on a successful application of statistical and inductive learning methods to determine optimal discriminating parameters and develop predictive models for the determination of faecal sources in waters, recently and heavily polluted with wastewaters (microbial source tracking). The data comes from an international study in which various microbial and chemical parameters were determined in heavily polluted waters from diverse geographical areas. A total of 38 variables derived from the microbial and chemical parameters were defined to characterise the available 103 observations. Four methods were evaluated: Euclidean k-nearest-neighbour, linear Bayesian classifier, quadratic Bayesian classifier and a support vector machine. The main aim was the obtention of highly accurate predictive models using the lowest number of variables possible. After a strong feature selection process, the obtained results show that predictive models using only two variables emerge with 100% correct classification. The obtained solutions make use of a linear combination of a discriminating tracer (the enumeration of phages infecting Bacteroides thetaiotaomicron) and a universal non-discriminant faecal indicator. Other models not using the discriminant tracer were developed, though a higher number of variables was needed to achieve a high rate of correct classification.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[3]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[4]  J. V. Healy,et al.  Experience with data mining for the anaerobic wastewater treatment process , 2007, Environ. Model. Softw..

[5]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[6]  V. Harwood,et al.  Classification of Antibiotic Resistance Patterns of Indicator Bacteria by Discriminant Analysis: Use in Predicting the Source of Fecal Contamination in Subtropical Waters , 2000, Applied and Environmental Microbiology.

[7]  David G. Stork,et al.  Pattern Classification , 1973 .

[8]  Joan B. Rose,et al.  Microbial Source Tracking: Current Methodology and Future Directions , 2002, Applied and Environmental Microbiology.

[9]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  B A Wiggins,et al.  Discriminant analysis of antibiotic resistance patterns in fecal streptococci, a method to differentiate human and animal sources of fecal pollution in natural waters , 1996, Applied and environmental microbiology.

[12]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[13]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[14]  D. Malakoff Microbiologists on the Trail of Polluting Bacteria , 2002, Science.

[15]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[16]  C. Nakatsu,et al.  Comparison of genotypic-based microbial source tracking methods requiring a host origin database. , 2003, Journal of water and health.

[17]  Lluís A. Belanche-Muñoz,et al.  Integrated Analysis of Established and Novel Microbial and Chemical Methods for Microbial Source Tracking , 2006, Applied and Environmental Microbiology.

[18]  A. Blanch,et al.  Method for Isolation of Bacteroides Bacteriophage Host Strains Suitable for Tracking Sources of Fecal Pollution in Water , 2005, Applied and Environmental Microbiology.

[19]  Peter Schlattmann,et al.  Theory and Algorithms , 2009 .

[20]  Sunny C. Jiang,et al.  Recommendations for microbial source tracking: lessons from a methods comparison study. , 2003, Journal of water and health.

[21]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[22]  J. M. Simpson,et al.  Microbial source tracking: state of the science. , 2002, Environmental science & technology.

[23]  K. Field,et al.  Molecular approaches to microbiological monitoring: fecal source detection. , 2003, Environmental monitoring and assessment.

[24]  Brian Everitt,et al.  Cluster analysis , 1974 .

[25]  Srinivasa Lingireddy,et al.  A neural-network-based classification scheme for sorting sources and ages of fecal contamination in water. , 2002, Water research.

[26]  Miquel Sànchez-Marrè,et al.  GESCONDA: An intelligent data analysis system for knowledge discovery and management in environmental databases , 2006, Environ. Model. Softw..

[27]  G. Papageorgiou,et al.  Tracking the origin of faecal pollution in surface water: an ongoing project within the European Union research programme. , 2004, Journal of water and health.

[28]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[29]  Kerry J Ritter,et al.  Assessment of statistical methods used in library-based approaches to microbial source tracking. , 2003, Journal of water and health.

[30]  Stéphane Canu,et al.  Environmental data mining and modeling based on machine learning algorithms and geostatistics , 2004, Environ. Model. Softw..