Machine Learning Approach for Homepage Finding Task

This paper describes new machine learning approaches to predict the correct homepage in response to a user's homepage finding query. This involves two phases. In the first phase, a decision tree is generated to predict whether a URL is a homepage URL or not. The decision tree then is used to filter out non-homepages from the web pages returned by a standard vector space information retrieval system. In the second phase, a logistic regression analysis is used to combine multiple sources of evidence based on the homepages remaining from the first step to predict which homepage is most relevant to a user's query. 100 queries are used to train the logistic regression model and another 145 testing queries are used to evaluate the model derived. Our results show that about 84% of the testing queries had the correct homepage returned within the top 10 pages. This shows that our machine learning approaches are effective since without any machine learning approaches, only 59% of the testing queries had their correct answers returned within the top 10 hits.

[1]  Jacques Savoy,et al.  Report on the TREC-10 Experiment: Distributed Collections and Entrypage Searching , 2001, TREC.

[2]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[3]  Norbert Fuhr,et al.  Integration of probabilistic fact and text retrieval , 1992, SIGIR '92.

[4]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[5]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[6]  Karen Spärck Jones,et al.  Generic summaries for indexing in information retrieval , 2001, SIGIR '01.

[7]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[8]  M. Lalmas,et al.  A model for the representation and focussed retrieval of structured documents based on fuzzy aggregation , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[9]  Fredric C. Gey,et al.  Logistic Regression at TREC4: Probabilistic Retrieval from Full Text Document Collections , 1995, TREC.

[10]  Aitao Chen,et al.  A comparison of regression, neural net, and pattern recognition approaches to IR , 1998, CIKM '98.

[11]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[12]  Garrison W. Cottrell,et al.  Fusion Via Linear Combination for the Routing Problem , 1997, TREC.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Djoerd Hiemstra,et al.  Retrieving Web Pages Using Content, Links, URLs and Anchors , 2001, TREC.

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Sumio Fujita,et al.  More Reflections on "Aboutness" TREC-2001 Evaluation Experiments at Justsystem , 2001, TREC.

[18]  Garrison W. Cottrell,et al.  Predicting the performance of linearly combined IR systems , 1998, SIGIR '98.

[19]  Donna K. Harman,et al.  Overview of the Ninth Text REtrieval Conference (TREC-9) , 2000, TREC.