Benchmarking Deep Learning Models for Classification of Book Covers

Book covers usually provide a good depiction of a book’s content and its central idea. The classification of books in their respective genre usually involves subjectivity and contextuality. Book retrieval systems would utterly benefit from an automated framework that is able to classify a book’s genre based on an image, specifically for archival documents where digitization of the complete book for the purpose of indexing is an expensive task. While various modalities are available (e.g., cover, title, author, abstract), benchmarking the image-based classification systems based on minimal information is a particularly exciting field due to the recent advancements in the domain of image-based deep learning and its applicability. For that purpose, a natural question arises regarding the plausibility of solving the problem of book classification by only utilizing an image of its cover along with the current state-of-the-art deep learning models. To answer this question, this paper makes a three-fold contribution. First, the publicly available book cover dataset comprising of 57k book covers belonging to 30 different categories is thoroughly analyzed and corrected. Second, it benchmarks the performance on a battery of state-of-the-art image classification models for the task of book cover classification. Third, it uses explicit attention mechanisms to identify the regions that the network focused on in order to make the prediction. All of our evaluations were performed on a subset of the mentioned public book cover dataset. Analysis of the results revealed the inefficacy of the most powerful models for solving the classification task. With the obtained results, it is evident that significant efforts need to be devoted in order to solve this image-based classification task to a satisfactory level.

[1]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Marek Kozlowski,et al.  Reading Book by the Cover - Book Genre Detection Using Short Descriptions , 2017, ICMMI.

[4]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[5]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[8]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[9]  Douglas Turnbull,et al.  You Can Judge an Artist by an Album Cover: Using Images for Music Annotation , 2011, IEEE MultiMedia.

[10]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[12]  Xavier Serra,et al.  Multimodal Deep Learning for Music Genre Classification , 2018, Trans. Int. Soc. Music. Inf. Retr..

[13]  Trevor Darrell,et al.  Recognizing Image Style , 2013, BMVC.

[14]  Pau Rodríguez,et al.  A Painless Attention Mechanism for Convolutional Neural Networks , 2018 .

[15]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Xavier Serra,et al.  Multi-Label Music Genre Classification from Audio, Text and Images Using Deep Features , 2017, ISMIR.

[19]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Bryan Pardo,et al.  Classifying paintings by artistic genre: An analysis of features & classifiers , 2009, 2009 IEEE International Workshop on Multimedia Signal Processing.

[21]  Seiichi Uchida,et al.  How do Convolutional Neural Networks Learn Design? , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[22]  Marcus Liwicki,et al.  Deepdocclassifier: Document classification with deep Convolutional Neural Network , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[23]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[24]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[25]  S. Sastry,et al.  Cross-Entropy Loss Leads To Poor Margins , 2018 .

[26]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[28]  Seiichi Uchida,et al.  Judging a Book By its Cover , 2016, ArXiv.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[31]  Marek Kozlowski,et al.  Deep Learning Approaches towards Book Covers Classification , 2018, ICPRAM.

[32]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[33]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[34]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[36]  Xiaojun Wan,et al.  Recent advances in document summarization , 2017, Knowledge and Information Systems.

[37]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[38]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[39]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.