A Naive Bayes approach for URL classification with supervised feature selection and rejection framework

Web page classification has become a challenging task due to the exponential growth of the World Wide Web. Uniform Resource Locator (URL)‐based web page classification systems play an important role, but high accuracy may not be achievable as URL contains minimal information. Nevertheless, URL‐based classifiers along with rejection framework can be used as a first‐level filter in a multistage classifier, and a costlier feature extraction from contents may be done in later stages. However, noisy and irrelevant features present in URL demand feature selection methods for URL classification. Therefore, we propose a supervised feature selection method by which relevant URL features are identified using statistical methods. We propose a new feature weighting method for a Naive Bayes classifier by embedding the term goodness obtained from the feature selection method. We also propose a rejection framework to the Naive Bayes classifier by using posterior probability for determining the confidence score. The proposed method is evaluated on the Open Directory Project and WebKB data sets. Experimental results show that our method can be an effective first‐level filter. McNemar tests confirm that our approach significantly improves the performance.

[1]  Jianping Zhang,et al.  The Role of URLs in Objectionable Web Content Categorization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[2]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[3]  Gil-Chang Kim,et al.  Multiple sets of features for automatic genre classification of web documents , 2005, Inf. Process. Manag..

[4]  K. Selvakuberan,et al.  Machine Learning Techniques for Automated Web Page Classification Using URL Features , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[5]  Michael Elad,et al.  Pattern Detection Using a Maximal Rejection Classifier , 2000, IWVF.

[6]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  Nidhi Singh,et al.  Large Scale URL-based Classification Using Online Incremental Learning , 2012, 2012 11th International Conference on Machine Learning and Applications.

[8]  Ajay S. Patil,et al.  Automated Classification of Web Sites using Naive Bayesian Algorithm , 2012 .

[9]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[10]  K. R. Chandran,et al.  An enhanced ACO algorithm to select features for text categorization and its parallelization , 2012, Expert Syst. Appl..

[11]  Venkatesh Saligrama,et al.  Multi-stage classifier design , 2012, Machine Learning.

[12]  K. R. Chandran,et al.  Naïve Bayes text classification with positive features selected by statistical method , 2009, 2009 First International Conference on Advanced Computing.

[13]  Chandrabose Aravindan,et al.  Web page classification using n-gram based URL features , 2013, 2013 Fifth International Conference on Advanced Computing (ICoAC).

[14]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[15]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[16]  David Ruiz,et al.  An Experiment to Test URL Features for Web Page Classification , 2012, PAAMS.

[17]  Brian D. Davison,et al.  Knowing a web page by the company it keeps , 2006, CIKM '06.

[18]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[19]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[20]  J. Pierce An introduction to information theory: symbols, signals & noise , 1980 .

[21]  Monika Henzinger,et al.  A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification , 2011, TWEB.

[22]  R. Rajalakshmi,et al.  Experimental Study Of Feature Weighting Techniques For URL Based Webpage Classification , 2017 .

[23]  Amir Masoud Rahmani,et al.  Webpage Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages , 2010, Int. J. Adv. Comp. Techn..

[24]  John M. Pierre,et al.  On the Automated Classification of Web Sites , 2001, ArXiv.

[25]  Rafael Corchuelo,et al.  CALA: An unsupervised URL-based web page classification system , 2014, Knowl. Based Syst..

[26]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[27]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[28]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[29]  R. Rajalakshmi Identifying Health Domain URLs using SVM , 2015, WCI '15.

[30]  Lluís A. Belanche Muñoz,et al.  Evaluating Feature Selection Algorithms , 2002, CCIA.

[31]  C. K. Chow,et al.  On optimum recognition error and reject tradeoff , 1970, IEEE Trans. Inf. Theory.

[32]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[33]  Min-Yen Kan Web page classification without the web page , 2004, WWW Alt. '04.

[34]  Rafael Corchuelo,et al.  A statistical approach to URL-based web page clustering , 2012, WWW.

[35]  Erik J. Scheme,et al.  Confidence-Based Rejection for Improved Pattern Recognition Myoelectric Control , 2013, IEEE Transactions on Biomedical Engineering.

[36]  Urszula Libal Multistage Naive Bayes Classifier with Reject Option for Multiresolution Signal Representation , 2013, ICPRAM.

[37]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[38]  Mark Craven,et al.  Relational Learning with Statistical Predicate Invention: Better Models for Hypertext , 2001, Machine Learning.

[39]  C. K. Chow,et al.  An optimum character recognition system using decision functions , 1957, IRE Trans. Electron. Comput..

[40]  R. Rajalakshmi,et al.  Supervised Term Weighting Methods for URL Classification , 2014, J. Comput. Sci..

[41]  Lorenzo Blanco,et al.  Highly efficient algorithms for structural clustering of large websites , 2011, WWW.

[42]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[43]  Venkata Durga Kiran Kasula Performance Analysis of Layered Architecture to integrate Mobile Devices and Grid Computing with a Resource Scheduling Algorithm , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[44]  Chandrabose Aravindan,et al.  Naive Bayes Approach for Website Classification , 2011 .