Improving Encarta Search Engine Performance by Mining User Logs

We propose a data-mining approach that produces generalized query patterns (with generalized keywords) from the raw user logs of the Microsoft Encarta search engine (http://encarta.msn.com). Those query patterns can act as cache of the search engine, improving its performance. The cache of the generalized query patterns is more advantageous than the cache of the most frequent user queries since our patterns are generalized, covering more queries and future queries even those not previously asked. Our method is unique since query patterns discovered reflect the actual dynamic usage and user feedbacks of the search engine, rather than the syntactic linkage structure of web pages (as Google does). Simulation shows that such generalized query patterns improve search engine’s overall speed considerably. The generalized query patterns, when viewed with a graphical user interface, are also helpful to web editors, who can easily discover topics in which users are mostly interested.

[1]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[2]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[3]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[4]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[5]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[6]  Jiawei Han,et al.  Data-Driven Discovery of Quantitative Rules in Relational Databases , 1993, IEEE Trans. Knowl. Data Eng..

[7]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[8]  Jiawei Han,et al.  Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[9]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[10]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[11]  Boris Chidlovskii,et al.  Semantic Cache Mechanism for Heterogeneous Web Querying , 1999, Comput. Networks.

[12]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[13]  Jiawei Han,et al.  Mining knowledge at multiple concept levels , 1995, CIKM '95.

[14]  James Parker,et al.  on Knowledge and Data Engineering, , 1990 .

[15]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[16]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[17]  Jiawei Han,et al.  Generalization and decision tree induction: efficient classification in data mining , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[18]  Fabio Crestani,et al.  Exploiting the Similarity of Non-Matching Terms at Retrieval Time , 2000, Information Retrieval.

[19]  Alan Gilchrist,et al.  Thesaurus construction: a practical manual , 1972 .

[20]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[21]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[22]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[23]  Neel Sundaresan,et al.  Mining the Web for relations , 2000, Comput. Networks.

[24]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.

[25]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[26]  Padmini Srivasan,et al.  Thesaurus Construction , 1992, Information Retrieval: Data Structures & Algorithms.

[27]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.