Breaking the curse of dimensionality in quadratic discriminant analysis models with a novel variant of a Bayes classifier enhances automated taxa identification of freshwater macroinvertebrates

Macroinvertebrate samples are commonly used in biomonitoring to study changes on aquatic ecosystems. Traditionally, specimens are identified manually to taxa by human experts being time‐consuming and cost intensive. Using the image data of 35 taxa and 64 features, we propose a novel variant of the quadratic discriminant analysis for breaking the curse of dimensionality in quadratic discriminant analysis models. Our variant, called a random Bayes array (RBA), uses bagging and random feature selection similar to random forest. We explore several variations of RBA. We consider three classification (i.e taxa identification) decisions: majority vote, averaged posterior probabilities, and a novel approach; a score of weighted votes. Besides modifying the voting, we propose to weight features according to their importance instead of eliminating the least important features. We compared the performance of RBA with traditional Bayesian and several other popular classification methods and assessed how the methods behave in relation to each other and the different macroinvertebrate species. Further, we investigate how severely misclassifications affect the performance of different methods when set into a biomonitoring context. We found that the lowest and least severe classification error (i.e. most accurate taxa identification) was achieved with RBA by using averaged posterior probabilities and weighted features. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  J. Friedman Regularized Discriminant Analysis , 1989 .

[2]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[3]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[4]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[5]  M. T. Ahmed Millennium ecosystem assessment , 2002, Environmental science and pollution research international.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[8]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  Dirk Van den Poel,et al.  Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB , 2007, DEXA.

[12]  A. Zeileis,et al.  Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance , 2008 .

[13]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[14]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  Thomas G. Dietterich,et al.  Automated processing and identification of benthic invertebrate samples , 2010, Journal of the North American Benthological Society.

[17]  Peter Haase,et al.  First audit of macroinvertebrate samples from an EU Water Framework Directive monitoring program: human error greatly lowers precision of assessment results , 2010, Journal of the North American Benthological Society.

[18]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[19]  Kristian Meissner,et al.  Statistical classification and proportion estimation - an application to a macroinvertebrate image database , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[20]  Ws. Rasband ImageJ, U.S. National Institutes of Health, Bethesda, Maryland, USA , 2011 .

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Martti Juhola,et al.  Classification and retrieval on macroinvertebrate image databases , 2011, Comput. Biol. Medicine.

[23]  Martti Juhola,et al.  DAGSVM vs. DAGKNN: An Experimental Case Study with Benthic Macroinvertebrate Dataset , 2012, MLDM.

[24]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .