A Web Page Classifier Library Based on Random Image Content Analysis Using Deep Learning

In this paper we present a methodology and the corresponding Python library1 for the classification of webpages. The method retrieves a fixed number of images from a given webpage, and based on them classifies the webpage into a set of established classes with a given probability. The library trains a random forest model built upon the features extracted from images by a pre-trained neural network. The implementation is tested by recognizing weapon class webpages in a curated list of 3859 websites. The results show that the best method of classifying a webpage among the classes of interest is to assign the class according to the maximum probability of any image belonging to this (weapon) class being above the threshold, across all the retrieved images. Our finding can have an important impact in the treatment of internet addictions.

[1]  Amaury Lendasse,et al.  A new application of machine learning in health care , 2016, PETRA.

[2]  Radek Burget,et al.  Web Page Element Classification Based on Visual Features , 2009, 2009 First Asian Conference on Intelligent Information and Database Systems.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Nuno Goncalves,et al.  Automatic Web Page Classification Using Visual Content , 2014, WEBIST.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Amaury Lendasse,et al.  Brain MRI morphological patterns extraction tool based on Extreme Learning Machine and majority vote classification , 2016, Neurocomputing.

[8]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[9]  S. Coughlin,et al.  A systematic review of studies of web portals for patients with diabetes mellitus. , 2017, mHealth.

[10]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[11]  C. V. Jawahar,et al.  Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Viktor de Boer,et al.  Web Page Classification Using Image Analysis Features , 2010, WEBIST.

[13]  Barbara Carminati,et al.  Content-Based Filtering in On-Line Social Networks , 2010, PSDML.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  Daeyoung Park,et al.  A Fuzzy Ontology and SVM–Based Web Content Classification System , 2017, IEEE Access.

[16]  Mark Griffiths,et al.  Sex on the internet: Observations and implications for internet sex addiction , 2001 .

[17]  Amaury Lendasse,et al.  Brute-force Missing Data Extreme Learning Machine for Predicting Huntington's Disease , 2017, PETRA.

[18]  Sachio Hirokawa,et al.  Non-Topical Classification of Healthcare Information on the Web , 2014, IDT/IIMSS/STET.

[19]  Amaury Lendasse,et al.  Image-based Classification of Websites , 2013 .

[20]  Radek Burget,et al.  Web document description based on ontologies , 2013, 2013 Second International Conference on Informatics & Applications (ICIA).

[21]  Joshua C. Denny,et al.  Automated Classification of Consumer Health Information Needs in Patient Portal Messages , 2015, AMIA.