Web-Document Retrieval by Genetic Learning of Importance Factors for HTML Tags

In contrast to conventional documents, a Web document consists of a number of tags which provide hints on the structure of the documents. In this paper, we propose a Web-document retrieval method using the characteristics of HTML tags. This method learns the importance of tags from a training text set. We use a genetic algorithm for learning the importance weights. We also present a modi ed similarity measure which uses the tag information. Experiments have been performed on the TREC document collection consisting of 247,491 documents. Compared to the traditional IR method, the proposed method has achieved 15% improvement in average precision.

[1]  Byoung-Tak Zhang,et al.  A Two-Stage Retrieval Model for the TREC-7 Ad Hoc Task , 1998, TREC.

[2]  Justin Picard,et al.  Modeling and combining evidence provided by document relationships using probabilistic argumentation systems , 1998, SIGIR '98.

[3]  L. Y. Tseng,et al.  (1997 IEEE International Conference on Neural Networks,p1612-p1616)Genetic Algorithms for Clustering,Feature Selection and Classification , 1997 .

[4]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1992, Artificial Intelligence.

[5]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[6]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[7]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[8]  Vijay V. Raghavan,et al.  Optimal Determination of User-Oriented Clusters: An Application for the Reproductive Plan , 1987, ICGA.

[9]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[10]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[11]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[14]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[15]  Robert R. Korfhage,et al.  Query Improvement in Information Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project , 1992, TREC.

[16]  Weiyi Meng,et al.  A new study on using HTML structures to improve retrieval , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[17]  Michael D. Gordon Probabilistic and genetic algorithms in document retrieval , 1988, CACM.