Methods of training set construction: Towards improving performance for automated mesozooplankton image classification systems

Abstract The correspondence between variation in the physico-chemical properties of the water column and the taxonomic composition of zooplankton communities represents an important indicator of long-term and broad-scale change in marine systems. Evaluating and relating compositional change to various forms of perturbation demand routine taxonomic identification methods that can be applied rapidly and accurately. Traditional identification by human experts is accurate but very time-consuming. The application of automated image classification systems for plankton communities has emerged as a potential resolution to this limitation. The objective of this study is to evaluate how specific aspects of training set construction for the ZooScan system influenced our ability to relate variation in zooplankton taxonomic composition to variation of hydrographic properties in the East China Sea. Specifically, we compared the relative utility of zooplankton classifiers trained with the following: (i) water mass-specific and global training sets; (ii) balanced versus imbalanced training sets. The classification performance (accuracy and precision) of water-mass specific classifiers tended to decline with environmental dissimilarity, suggesting water-mass specificity However, similar classification performance was also achieved by training our system with samples representing all hydrographic sub-regions (i.e. a global classifier). After examining category-specific accuracy, we found that equal performance arises because the accuracy was mainly determined by dominant taxa. This apparently high classification accuracy was at the expense of accurate classification of rare taxa. To explore the basis for such biased classification, we trained our global classifier with an equal amount of training data for each category (balanced training). We found that balanced training had higher accuracy at recognizing rare taxa but low accuracy at abundant taxa. The errors introduced in recognition still pose a major challenge for automatic classification systems. In order to fully automate analyses of zooplankton communities and relate variation in composition to hydrographic properties, the recognition power of the system requires further improvements.

[1]  Ping-Tung Shaw,et al.  Circulation and biogeochemical processes in the East China Sea and the vicinity of Taiwan: an overview and a brief synthesis , 2003 .

[2]  Shenn-Yu Chao,et al.  A climatological description of circulation in and around the East China Sea , 2003 .

[3]  Marc Picheral,et al.  Digital zooplankton image analysis using the ZooScan integrated system , 2010 .

[4]  Beau B. Gregory,et al.  Rapid biogeographical plankton shifts in the North Atlantic Ocean , 2009 .

[5]  Philippe Grosjean,et al.  Enumeration, measurement, and identification of net zooplankton samples using the ZOOSCAN digital imaging system , 2004 .

[6]  G. Sugihara,et al.  Climate‐driven changes in abundance and distribution of larvae of oceanic fishes in the southern California region , 2009 .

[7]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[8]  G. Sugihara,et al.  Spatial analysis shows that fishing enhances the climatic sensitivity of marine fishes , 2008 .

[9]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[10]  U. Sommer,et al.  Changes in biogenic carbon flow in response to sea surface warming , 2009, Proceedings of the National Academy of Sciences.

[11]  P. Wiebe,et al.  From the Hensen net toward four-dimensional biological oceanography , 2003 .

[12]  Jiang‐Shiou Hwang,et al.  Copepod communities related to water masses in the southwest East China Sea , 2008, Helgoland Marine Research.

[13]  J. A. Lozano,et al.  Optimizing the number of classes in automated zooplankton classification , 2009 .

[14]  Astthor Gislason,et al.  Comparison between automated analysis of zooplankton using ZooImage and traditional methodology , 2009 .

[15]  Marti J. Anderson,et al.  Permutation tests for univariate or multivariate analysis of variance and regression , 2001 .

[16]  Wei-Cheng Su,et al.  Copepod Community Changes in the Southern East China Sea between the Early and Late Northeasterly Monsoon , 2008 .

[17]  Remco R. Bouckaert,et al.  Choosing Between Two Learning Algorithms Based on Calibrated Tests , 2003, ICML.

[18]  Pierre-Emmanuel Jouve,et al.  Optimisation and Evaluation of Random Forests for Imbalanced Datasets , 2006, ISMIS.

[19]  C. Hsieh,et al.  Phytoplankton community reorganization driven by eutrophication and warming in Lake Biwa , 2010, Aquatic Sciences.

[20]  R. Hopcroft,et al.  Assessment of ZooImage as a tool for the classification of zooplankton , 2008 .

[21]  P. Utgoff,et al.  RAPID: Research on Automated Plankton Identification , 2007 .

[22]  Michael J. Pazzani,et al.  Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[23]  P. C. Reid,et al.  Reorganization of North Atlantic Marine Copepod Biodiversity and Climate , 2002, Science.

[24]  Chih-hao Hsieh,et al.  Composition and abundance of copepods and ichthyoplankton in Taiwan Strait (western North Pacific) are influenced by seasonal monsoons , 2005 .

[25]  He Huang,et al.  Automatic Plankton Image Recognition , 1998, Artificial Intelligence Review.

[26]  C. Devey,et al.  Introduction to the InterRidge Special Issue , 2007 .

[27]  B. Manly Randomization, Bootstrap and Monte Carlo Methods in Biology , 2018 .

[28]  C. Davis,et al.  Real-time observation of taxa-specific plankton distributions: an optical sampling method , 2004 .

[29]  D. Mackas Spatial autocorrelation of plankton community composition in a continental shelf ecosystem , 1984 .

[30]  Phil Culverhouse,et al.  Time to automate identification , 2010, Nature.

[31]  Doug Fisher,et al.  Learning from Data: Artificial Intelligence and Statistics V , 1996 .

[32]  Gwo-Ching Gong,et al.  Seasonal variation of chlorophyll a concentration, primary production and environmental conditions in the subtropical East China Sea , 2003 .

[33]  P. Culverhouse,et al.  Automatic classification of field-collected dinoflagellates by artificial neural network , 1996 .

[34]  Phil F. Culverhouse,et al.  Biological pattern recognition by neural networks , 1991 .

[35]  Marti J. Anderson,et al.  Species abundance distributions: moving beyond single prediction theories to integration within an ecological framework. , 2007, Ecology letters.

[36]  Gwo-Ching Gong,et al.  Chemical hydrography and chlorophyll a distribution in the East China Sea in summer: implications in nutrient dynamics , 1996 .

[37]  Louis Legendre,et al.  Marine biodiversity, ecosystem functioning, and carbon cycles , 2010, Proceedings of the National Academy of Sciences.

[38]  Christopher D G Harley,et al.  The impacts of climate change in coastal marine systems. , 2006, Ecology letters.