Evolutionary learning of Web-document structure for information retrieval

Web documents have a number of tags indicating the structure of documents. The tag information can be utilized to improve the performance of document retrieval systems. The authors propose an approach to retrieve Web documents using HTML tags and then use a genetic algorithm to adapt the tag weights. This method uses a modified similarity measure based on the tag weights. A genetic learning method is used to select the tags for retrieval and get the optimal tag weights. We evaluate our method via experiments on conference pages and TREC document sets. The experimental results show that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval. The proposed method has achieved about 10% improvement in retrieval accuracy.

[1]  Nicholas J. Belkin,et al.  Retrieval techniques , 1987 .

[2]  Donna K. Harman,et al.  Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[3]  Jorng-Tzong Horng,et al.  Applying genetic algorithms to query optimization in document retrieval , 2000, Inf. Process. Manag..

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Robert R. Korfhage,et al.  Query Improvement in Information Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project , 1992, TREC.

[6]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[9]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[10]  Lin-Yu Tseng,et al.  Genetic algorithms for clustering, feature selection and classification , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[11]  Michael D. Gordon,et al.  Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines , 1999, Inf. Process. Manag..

[12]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[13]  Soumen Chakrabarti,et al.  Data mining for hypertext: a tutorial survey , 2000, SKDD.

[14]  Gerald Salton,et al.  Automatic text processing , 1988 .

[15]  Byoung-Tak Zhang,et al.  A Two-Stage Retrieval Model for the TREC-7 Ad Hoc Task , 1998, TREC.

[16]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[17]  Justin Picard,et al.  Modeling and combining evidence provided by document relationships using probabilistic argumentation systems , 1998, SIGIR '98.

[18]  Weiyi Meng,et al.  A new study on using HTML structures to improve retrieval , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[19]  Michael D. Gordon Probabilistic and genetic algorithms in document retrieval , 1988, CACM.

[20]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[21]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1992, Artificial Intelligence.

[22]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .