Visual content-based web page categorization with deep transfer learning and metric learning

Abstract The growing amounts of online multimedia content challenge the current search, recommendation and information retrieval systems. Information in the form of visual elements is highly valuable in a range of web mining tasks. However, the mining of these resources is a difficult task due to the complexity and variability of images, and the cost of collecting big enough datasets to successfully train accurate deep learning models. This paper proposes a novel framework for the categorization of web pages on the basis of their visual content. This is achieved by exploring the joint application of a transfer learning strategy and metric learning techniques to build a Deep Convolutional Neural Network (DCNN) for feature extraction, even when training data is scarce. The obtained experimental results evidence that the proposed approach outperforms the state-of-the-art handcrafted image descriptors and achieves a high categorization accuracy. In addition, we address the problem of over-time learning, so the proposed framework can learn to identify new web page categories as new labeled images are provided at test time. As a result, prior knowledge of the complete set of possible web categories is not necessary in the initial training phase.

[1]  Michael Shepherd,et al.  Large-Scale Web Page Classification , 2014, 2014 47th Hawaii International Conference on System Sciences.

[2]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Michael Milford,et al.  Deep learning features at scale for visual place recognition , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[5]  Dewei Li,et al.  Global and local metric learning via eigenvectors , 2017, Knowl. Based Syst..

[6]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[7]  Feng Zhu,et al.  Towards effective web page classification , 2016, 2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC).

[8]  Taufik Fuadi Abidin,et al.  Algorithm for updating n-grams word dictionary for web classification , 2016, 2016 International Conference on Informatics and Computing (ICIC).

[9]  Brian D. Davison,et al.  Topical link analysis for web search , 2006, SIGIR.

[10]  Juan M. Corchado,et al.  Deep neural networks and transfer learning applied to multimedia web mining , 2017, DCAI.

[11]  Gunther Heinrich Evaluation of a Distribution-Based Web Page Classification , 2017 .

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[14]  John P. Collomosse,et al.  Compact descriptors for sketch-based image retrieval using a triplet loss convolutional neural network , 2017, Comput. Vis. Image Underst..

[15]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[16]  Patrick Pérez,et al.  Revisiting the VLAD image representation , 2013, ACM Multimedia.

[17]  Hua Yu,et al.  Web Classification Using Deep Belief Networks , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[18]  Kavita Bala,et al.  Learning visual similarity for product design with convolutional neural networks , 2015, ACM Trans. Graph..

[19]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[21]  Horst Bischof,et al.  Large scale metric learning from equivalence constraints , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Nicu Sebe,et al.  Deep appearance and motion learning for egocentric activity recognition , 2018, Neurocomputing.

[23]  Johannes Fürnkranz,et al.  Link-Local Features for Hypertext Classification , 2005, EWMF/KDO.

[24]  Deng Cai,et al.  Deep feature based contextual model for object detection , 2016, Neurocomputing.

[25]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[26]  Chin-Hui Lee,et al.  A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition , 2016, Neurocomputing.

[27]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[29]  Yang Bo,et al.  A Method for Topic Classification of Web Pages Using LDA-SVM Model , 2017 .

[30]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[31]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[32]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[33]  Xiaogang Wang,et al.  Deep Learning Face Representation by Joint Identification-Verification , 2014, NIPS.

[34]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[35]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[36]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[37]  Taghi M. Khoshgoftaar,et al.  A survey of transfer learning , 2016, Journal of Big Data.

[38]  Xv Lan,et al.  LWCS: A large-scale web page classification system based on anchor graph hashing , 2015, 2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[39]  Hironao Takahashi,et al.  Semantic Based Highly Accurate Autonomous Decentralized URL Classification System for Web Filtering , 2015, 2015 IEEE Twelfth International Symposium on Autonomous Decentralized Systems.

[40]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Juan M. Corchado,et al.  Hybridizing metric learning and case-based reasoning for adaptable clickbait detection , 2017, Applied Intelligence.

[43]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[44]  Reza Ebrahimpour,et al.  Mixture of experts: a literature survey , 2014, Artificial Intelligence Review.

[45]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[46]  Feiping Nie,et al.  Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization , 2011, SIGIR.

[47]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Juan M. Corchado,et al.  A CBR System for Image-Based Webpage Classification: Case Representation with Convolutional Neural Networks , 2017, FLAIRS Conference.

[49]  Iasonas Kokkinos,et al.  Discriminative Learning of Deep Convolutional Feature Point Descriptors , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Koraljka Golub,et al.  Importance of HTML Structural Elements and Metadata in Automated Subject Classification , 2005, ECDL.