Web Page Classification Using MDAWkNN

Web page classification, WPC, is quite simply the process of assigning labels (categories) to Web pages based on the kind of content they have. For e.g. News, sport, education, entertainment etc. It is slightly more challenging than text classification because of the dynamic content that Web pages have. This content ranges from text to flash, videos and picture. It can broadly be subdivided into two kinds – functional classification and subject classification. Functional classification is basically classifying the Website based on the role it plays while subject classification deals with the actual content of the Webpage. The various applications of Web page classification include but are not restricted to constructing and expanding Web directories, improving the quality of search results, assisted Web browsing (suggesting similar content on pages such as YouTube), knowledge base construction and Web content filtering (blocking illegal content). One of the techniques to classify Web pages is to use the content of the Web page with the traditional machine learning methods namely decision tree based methods like J48 (Mark Hall,2009) probabilistic methods like Naïve Bayes NB, instance based methods like K Nearest Neighbor (kNN), etc. Of these, the kNN algorithm is one of the most simple and easy to use methods. It is based on the principle of using distance measures to classify an unknown sample. One of the most commonly used distance measure is the Euclidean measure. Given a training set and a test data, the distance between the test sample and each of the training samples is calculated. Based on this distance, the test sample is assigned the class label of its k nearest neighbors using majority voting. This is one of the most universally used algorithms with several advantages, the most important being the ease of use. It is also robust with regard to search space meaning classes need not be linearly separable. It can also be easily updated and it deals with very few parameters. However it has certain disadvantages too. It is very much computationally intensive as the Euclidean distance needs to be calculated for each training sample with the test sample. There is also no systematic approach to choosing the best value of ‘k’. Another problem deals with tie breaking, a scenario that occurs when there are an equal number of nearest neighbors that belong to different classes. kNN is also sensitive to noisy attributes. In this chapter, the traditional kNN algorithm is improved for Web page classification. As thousands of features are used to induce a Web page classifier, the traditional kNN utilizes more system resources and needs more induction time, as it needs to compare the test data with every training example. Also it identifies the k nearest neighbors to the test data and applies simple majority voting to predict the class of the test data. In a data set with imbalanced class distribution, most of the k nearest neighbors may belong to J. Alamelu Mangai Birla Institute of Technology and Science Pilani, Dubai

[1]  Chih-Ming Chen,et al.  Two novel feature selection approaches for web page classification , 2009, Expert Syst. Appl..

[2]  S. Appavu alias Balamurugan,et al.  A novel feature selection framework for automatic web page classification , 2012, Int. J. Autom. Comput..

[3]  S. Appavu alias Balamurugan,et al.  Improving decision tree performance by exception handling , 2010, Int. J. Autom. Comput..

[4]  John Wang,et al.  Encyclopedia of Business Analytics and Optimization , 2018 .

[5]  H. Altay Güvenir,et al.  Classification by Voting Feature Intervals , 1997, ECML.

[6]  Zhong Ming,et al.  Text Learning and Hierarchical Feature Selection in Webpage Classification , 2008, ADMA.

[7]  Komal Kumar Bhatia,et al.  Domain Identification and Classification of Web Pages Using Artificial Neural Network , 2013 .

[8]  Saadat M. Alhashmi,et al.  Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework , 2010 .

[9]  Dianhong Wang,et al.  Survey of Improving K-Nearest-Neighbor for Classification , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[10]  Yong Yu,et al.  A Novel Web Page Categorization Algorithm Based on Block Propagation Using Query-Log Information , 2006, WAIM.

[11]  Sun Bo,et al.  A Study on Automatic Web Pages Categorization , 2009, 2009 IEEE International Advance Computing Conference.

[12]  Vladimír Bartík Text-Based Web Page Classification with Use of Visual Information , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[13]  Ugur Ayan,et al.  Correlation between the Economy News and Stock Market in Turkey , 2013, Int. J. Bus. Intell. Res..

[14]  J. Michael Hardin,et al.  Credit Scoring in the Age of Big Data , 2014 .

[15]  Peiying Zhang,et al.  The Effective Classification of the Chines e Web Pages Based on KNN , 2010 .

[16]  Takashi Washio,et al.  Automatic Web-Page Classification by Using Machine Learning Methods , 2001, Web Intelligence.

[17]  Maryam Mahmoudi,et al.  A Persian Web Page Classifi er Applying a Combination of Content-Based and Context-Based Features , 2009 .

[18]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[19]  Yaquan Xu,et al.  A new feature selection method based on support vector machines for text categorisation , 2011, Int. J. Data Anal. Tech. Strateg..

[20]  Ali Selamat,et al.  Web page feature selection and classification using neural networks , 2004, Inf. Sci..

[21]  Roger G. Stone,et al.  Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages , 2009 .

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.