Web Browsing Using Machine Learning on Text Data

Web browsing is gaining popularity with the growing number of Web users, especially for a casual usage of the Web, when the user does not have a precise query in mind. By observing the user’s behavior when browsing, we build a model of promising hyperlinks and use it to highlight hyperlinks on the requested Web pages. In order to do that, we propose text-learning methods for handling high dimensional problems (having several tens of thousands of features) with highly unbalanced class distribution (more than 90% of examples having the majority class value). The reported experimental results on the user modeling problem are consistent with the extensive experimental results that were performed on a related problem of modeling Web document content category by using hyperlink to the document. The results show that when modeling by Naive Bayesian classifier, it is highly important how we select the features to be used in the model. Namely, the best performing feature selection in our experiments on Personal WebWatcher data is when the features are scored according to Odds Ratio and only a small number of the best features is used for learning.

[1]  Ivan Bratko,et al.  Information-Based Evaluation Criterion for Classifier's Performance , 1991, Machine Learning.

[2]  C. J. van Rijsbergen,et al.  The selection of good search terms , 1981, Inf. Process. Manag..

[3]  Robert E. Kraut,et al.  The HomeNet field trial of residential Internet services , 1996, CACM.

[4]  J. R. Quinlan Constructing Decision Trees , 1993 .

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  Dunja Mladenic,et al.  Text-learning and related intelligent agents: a survey , 1999, IEEE Intell. Syst..

[8]  MladenicDunja Text-Learning and Related Intelligent Agents , 1999 .

[9]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[10]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[11]  William M. Shaw,et al.  Termrelevance Computations and Perfect Retrieval Performance , 1995, Inf. Process. Manag..

[12]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[13]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[14]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.