Implicit Links-Based Techniques to Enrich K-Nearest Neighbors and Naive Bayes Algorithms for Web Page Classification

The web has developed into one of the most relevant data sources and becomes now a broad knowledge base for almost all fields. Its content grows faster, and its size becomes larger every day. Due to this big amount of data, web page classification becomes crucial since users encounter difficulties in finding what they are seeking, even though they use search engines. Web page classification is the process of assigning a web page to one or more classes based on previously seen labeled examples. Web pages contain a lot of contextual features that can be used to enhance the classification’s accuracy. In this paper, we present a similarity computation technique that is based on implicit links extracted from the query-log, and used with K-Nearest Neighbors (KNN) in web page classification. We also introduce an implicit links-based probability computation method used with Naive Bayes (NB) for web page classification. The new computed similarity and probability help enrich KNN and NB respectively for web page classification. Experiments are conducted on two subsets of Open Directory Project (ODP). Results show that: (1) when applied as a similarity for KNN, the implicit links-based similarity helps improve results. (2) the implicit links-based probability helps ameliorate results provided by NB using only text-based probability.

[1]  Jong-Hyeok Lee,et al.  Web page classification based on k-nearest neighbor approach , 2000, IRAL '00.

[2]  Yan Xiao-wei AUTOMATED TEXT CLASSIFICATION , 2001 .

[3]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[4]  Soo-Min Kim,et al.  Improving web page classification by label-propagation over click graphs , 2009, CIKM.

[5]  Zhijing Liu,et al.  A Novel Approach to Naive Bayes Web Page Automatic Classification , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[6]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[7]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[8]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[9]  D. Petcu BetweenWeb and Grid-based Mathematical Services , 2006, 2006 International Multi-Conference on Computing in the Global Information Technology - (ICCGI'06).

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Xu Cheng,et al.  An improved Naive Bayesian algorithm for Web page text classification , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[12]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[13]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[14]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[15]  Abdelbadie Belmouhcine,et al.  Formal Concept Analysis Based Corrective Ap- proach Using Query-log for Web Page Classifica- tion , 2014 .

[16]  Qiang Yang,et al.  Reinforcing Web-object Categorization Through Interrelationships , 2006, Data Mining and Knowledge Discovery.

[17]  V.F. Fernandez,et al.  Naive Bayes Web Page Classification with HTML Mark-Up Enrichment , 2006, 2006 International Multi-Conference on Computing in the Global Information Technology - (ICCGI'06).