Website Classification Using Word Based Multiple N -Gram Models and Random Search Oriented Feature Parameters

Website classification is a convenient starting point for building an intelligent web browser and social networking sites that can understand the favorite categories of a user and also detect adult or harmful websites perfectly. Classifying the web sites using the information of the Uniform Resource Locator (URL) is an important and fast technique. A perfect result is needed for URL classification to make it usable in the real world applications. So we have proposed an improved approach for URL classification that is able to provide a better result. We have introduced the word-based multiple n-gram models for efficient feature extraction and multinomial distribution for Naive Bayes classifier under the Random Search pipeline for hyperparameter optimization that finds the best parameters of the URL features. The experimental result of our research is compared with the result of previous research works and we have shown a better result than the existing result. Our experimental result provides 88.77% in recall and 87.63% in F1-Score which is the best performance so far.

[1]  Monika Henzinger,et al.  A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification , 2011, TWEB.

[2]  Javier Parapar,et al.  Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems , 2016, CERI.

[3]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[4]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[5]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[6]  K. Selvakuberan,et al.  Machine Learning Techniques for Automated Web Page Classification Using URL Features , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[7]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[8]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[9]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[10]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[11]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[12]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[13]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.

[14]  Cornelis H. A. Koster,et al.  On the Importance of Parameter Tuning in Text Categorization , 2006, Ershov Memorial Conference.

[15]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[16]  Sl Ting,et al.  Is Naïve bayes a good classifier for document classification , 2011 .

[17]  R. Rajalakshmi,et al.  Experimental Study Of Feature Weighting Techniques For URL Based Webpage Classification , 2017 .

[18]  Nidhi Singh,et al.  Online URL Classification for Large-Scale Streaming Environments , 2017, IEEE Intelligent Systems.

[19]  Beatriz de la Iglesia,et al.  URL-Based Web Page Classification: With n-Gram Language Models , 2014, IC3K.