Exploring Critical Aspects of CNN-based Keyword Spotting. A PHOCNet Study

Deep convolutional neural networks are today the new baseline for a wide range of machine vision tasks. The problem of keyword spotting is no exception to this rule. Many successful network architectures and learning strategies have been adapted from other vision tasks to create successful keyword spotting systems. In this paper, we argue that various details concerning this adaptation could be re-examined, to the end of building stronger spotting models. In particular, we examine the usefulness of a pyramidal spatial pooling layer versus a simpler approach, and show that a zoning strategy combined with fixed-size inputs can be just as effective while less computationally expensive. We also examine the usefulness of augmentation, class balancing and ensemble learning strategies and propose an improved network. Our hypotheses are tested with numerical experiments on the IAM document collection, where the proposed network outperforms all other existing models.

[1]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[2]  Gernot A. Fink,et al.  Evaluating Word String Embeddings and Loss Functions for CNN-Based Word Spotting , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[3]  Sfikas Giorgos,et al.  Zoning Aggregated Hypercolumns for Keyword Spotting , 2016 .

[4]  Chris Tensmeyer,et al.  Document Image Binarization with Fully Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5]  Basilios Gatos,et al.  Transferable Deep Features for Keyword Spotting , 2018, IWCIM@EUSIPCO.

[6]  Sudholt Sebastian,et al.  PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents , 2016 .

[7]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[10]  Gavin Brown,et al.  Ensemble Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[11]  C. V. Jawahar,et al.  Deep Feature Embedding for Accurate Recognition and Retrieval of Handwritten Text , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[14]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Anders Brun,et al.  Semantic and Verbatim Word Spotting Using Deep Neural Networks , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[16]  Basilios Gatos,et al.  A survey of document image word spotting techniques , 2017, Pattern Recognit..

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.