Improving search results with data mining in a thematic search engine

The problem of obtaining relevant results in web searching has been tackled with several approaches. Although very effective techniques are currently used by the most popular search engines when no a priori knowledge on the user's desires beside the search keywords is available, in different settings it is conceivable to design search methods that operate on a thematic database of web pages that refer to a common body of knowledge or to specific sets of users. We have considered such premises to design and develop a search method that deploys data mining and optimization techniques to provide a more significant and restricted set of pages as the final result of a user search. We adopt a vectorization method based on search context and user profile to apply clustering techniques that are then refined by a specially designed genetic algorithm. In this paper we describe the method, its implementation, the algorithms applied, and discuss some experiments that has been run on test sets of web pages.

[1]  Donald H. Kraft,et al.  GENETIC ALGORITHMS FOR QUERY OPTIMIZATION IN INFORMATION RETRIEVAL: RELEVANCE FEEDBACK , 1997 .

[2]  Weiyi Meng,et al.  A new study on using HTML structures to improve retrieval , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[3]  Michael D. Gordon Probabilistic and genetic algorithms in document retrieval , 1988, CACM.

[4]  Lance D. Chambers Practical handbook of genetic algorithms , 1995 .

[5]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[6]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[7]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[8]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[9]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[10]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[11]  Michael D. Gordon User‐based document clustering by redescribing subject descriptions with a genetic algorithm , 1991 .

[12]  Frederick E. Petry,et al.  Fuzzy Information Retrieval Using Genetic Algorithms and Relevance Feedback. , 1993 .

[13]  Robert R. Korfhage,et al.  Query Improvement in Information Retrieval Using Genetic Algorithms - A Report on the Experiments of the TREC Project , 1992, TREC.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Robert R. Korfhage,et al.  Query Optimization in Information Retrieval Using Genetic Algorithms , 1993, ICGA.

[16]  Ling Liu,et al.  Proceedings of the 2001 ACM CIKM : tenth International Conference on Information and Knowledge Management, November 5-10, 2001, Atlanta, Georgia, USA , 2001 .

[17]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[18]  Takanori Shibata,et al.  Genetic Algorithms And Fuzzy Logic Systems Soft Computing Perspectives , 1997 .

[19]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[20]  Zbigniew Michalewicz,et al.  Genetic algorithms + data structures = evolution programs (2nd, extended ed.) , 1994 .

[21]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[22]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[23]  Michael D. Gordon,et al.  Web Search---Your Way , 2001, CACM.

[24]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[25]  Mohand Boughanem,et al.  Genetic Approach to Query Space Exploration , 2004, Information Retrieval.

[26]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[27]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[28]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[29]  Jorng-Tzong Horng,et al.  Applying genetic algorithms to query optimization in document retrieval , 2000, Inf. Process. Manag..

[30]  Hsinchun Chen Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms , 1995 .

[31]  Z. Z. Nick,et al.  Web search using a genetic algorithm , 2001 .

[32]  D. E. Goldberg,et al.  Genetic Algorithms in Search, Optimization & Machine Learning , 1989 .