Simultaneous Localization and Segmentation of Fish Objects Using Multi-task CNN and Dense CRF

We propose a deep learning tool to localize fish objects in benthic underwater videos on a frame by frame basis. The deep network predicts fish object spatial coordinates and simultaneously segments the corresponding pixels of each fish object. The network follows a state of the art inception resnet v2 architecture that automatically generates informative features for object localization and mask segmentation tasks. Predicted masks are passed to dense Conditional Random Field (CRF) post-processing for contour and shape refinement. Unlike prior methods that rely on motion information to segment fish objects, our proposed method only requires RGB video frames to predict both box coordinates and object pixel masks. Independence from motion information makes our proposed model more robust to camera movements or jitters, and makes it more applicable to process underwater videos taken from unmanned water vehicles. We test the model in actual benthic underwater video frames taken from ten different sites. The proposed tool can segment fish objects despite wide camera movements, blurred underwater resolutions, and is robust to a wide variety of environments and fish species shapes.

[1]  Xiu Li,et al.  Fast accurate fish detection and recognition of underwater images with Fast R-CNN , 2015, OCEANS 2015 - MTS/IEEE Washington.

[2]  Alfonso B. Labao,et al.  Weakly-Labelled Semantic Segmentation of Fish Objects in Underwater Videos Using a Deep Residual Network , 2017, ACIIDS.

[3]  Gary R. Bradski,et al.  Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library , 2016 .

[4]  Xiu Li,et al.  Accelerating fish detection and recognition by sharing CNNs with objectness learning , 2016, OCEANS 2016 - Shanghai.

[5]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[7]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[8]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Hervé Glotin,et al.  LifeCLEF: Multimedia life species identification , 2014, EMR@ICMR.

[10]  Robert B. Fisher,et al.  Detecting, Tracking and Counting Fish in Low Quality Unconstrained Underwater Videos , 2008, VISAPP.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Guosheng Lin,et al.  Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15]  Alexandra Branzan Albu,et al.  Automatic fish counting system for noisy deep-sea videos , 2014, 2014 Oceans - St. John's.

[16]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[17]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.