A personalized search engine based on Web-snippet hierarchical clustering

We propose a (meta-)search engine, called SnakeT (SNippet Aggregation for Knowledge ExtracTion), which queries more than 18 commodity search engines and offers two complementary views on their returned results. One is the classical flat-ranked list, the other consists of a hierarchical organization of these results into folders created on-the-fly at query time and labeled with intelligible sentences that capture the themes of the results contained in them. Users can browse this hierarchy with various goals: knowledge extraction, query refinement and personalization of search results. In this novel form of personalization, the user is requested to interact with the hierarchy by selecting the folders whose labels (themes) best fit her query needs. SnakeT then personalizes on-the-fly the original ranked list by filtering out those results that do not belong to the selected folders. Consequently, this form of personalization is carried out by the users themselves and thus results fully adaptive, privacy preserving, scalable and non-intrusive for the underlying search engines. We have extensively tested SnakeT and compared it against the best available Web-snippet clustering engines. SnakeT is efficient and effective, and shows that a mutual reinforcement relationship between ranking and Web-snippet clustering does exist. In fact, the better the ranking of the underlying search engines, the more relevant the results from which SnakeT distills the hierarchy of labeled folders, and hence the more useful this hierarchy is to the user. Vice versa, the more intelligible the folder hierarchy, the more effective the personalization offered by SnakeT on the ranking of the query results. Copyright © 2007 John Wiley & Sons, Ltd. This work was done while the second author was a PhD student at the Dipartimento di Informatica, University of Pisa. The work contains the complete description and a full set of experiments on the software system SnakeT, which was partially published in the Proceedings of the 14th International World Wide Web Conference, Chiba, Japan, 2005

[1]  W. Davies,et al.  Berry, B.J.L. 1967: Geography of market centers and retail distribution. Englewood Cliffs, NJ: Prentice-Hall , 1992 .

[2]  Masatoshi Yoshikawa,et al.  Adaptive web search based on user profile constructed without any effort from users , 2004, WWW '04.

[3]  Wolfgang Nejdl,et al.  Using ODP metadata to personalize search , 2005, SIGIR '05.

[4]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[5]  Yoichi Shinoda,et al.  Information filtering based on user behavior analysis and best match text retrieval , 1994, SIGIR '94.

[6]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[7]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[8]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[9]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[10]  Oren Etzioni,et al.  Clustering web documents: a phrase-based method for grouping search engine results , 1999 .

[11]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[12]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[13]  Mark S. Ackerman,et al.  The perfect search engine is not enough: a study of orienteering behavior in directed search , 2004, CHI.

[14]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[15]  Israel Ben-Shaul,et al.  Ephemeral Document Clustering for Web Applications , 2001 .

[16]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[17]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[18]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[19]  Dawid Weiss,et al.  Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data , 2004, Intelligent Information Systems.

[20]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[21]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[22]  Andreas Rudolph,et al.  Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[23]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[24]  Feng Qiu,et al.  Automatic identification of user interest for personalized search , 2006, WWW '06.

[25]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[26]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[27]  Ronald Fagin,et al.  Comparing and aggregating rankings with ties , 2004, PODS '04.

[28]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[29]  Vipin Kumar,et al.  Personalized Profile Based Search Interface With Ranked and Clustered Display , 2001 .

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[32]  Emilio Di Giacomo,et al.  A Topology-Driven Approach to the Design of Web Meta-search Clustering Engines , 2005, SOFSEM.

[33]  András A. Benczúr,et al.  To randomize or not to randomize: space optimal summaries for hyperlink analysis , 2006, WWW '06.

[34]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[35]  Giuseppe Attardi,et al.  Theseus: Categorization by Context , 2000 .

[36]  W. Bruce Croft,et al.  Language models for hierarchical summarization , 2003 .

[37]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[38]  Dino Pedreschi,et al.  WebCat: Automatic Categorization of Web Search Results , 2003, SEBD.

[39]  Huan Liu,et al.  CubeSVD: a novel approach to personalized Web search , 2005, WWW '05.

[40]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[41]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[42]  Claudio Carpineto,et al.  Concept data analysis - theory and applications , 2004 .

[43]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[44]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[45]  Martha Sideri,et al.  The Compass Filter: Search Engine Result Personalization Using Web Communities , 2003, ITWP.

[46]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[47]  Krishna Bharat SearchPad: explicit capture of search context to support Web search , 2000, Comput. Networks.

[48]  Wolfgang Nejdl,et al.  PROS: A Personalized Ranking Platform for Web Search , 2004, AH.

[49]  Javed Mostafa,et al.  Seeking better Web searches. , 2005, Scientific American.

[50]  Wanda Pratt,et al.  A Knowledge-Based Approach to Organizing Retrieved Documents , 1999, AAAI/IAAI.

[51]  Masaru Kitsuregawa,et al.  On Combining Link and Contents Information for Web Page Clustering , 2002, DEXA.

[52]  Dawid Weiss,et al.  Web Search Results Clustering in Polish: Experimental Evaluation of Carrot , 2003, IIS.

[53]  Yi-fang Brook Wu,et al.  Extracting Features from Web Search Returned Hits for Hierarchical Classification , 2003, IKE.

[54]  Monika Henzinger,et al.  Query-free news search , 2003, WWW.

[55]  ChengXiang Zhai,et al.  Exploiting query history for document ranking in interactive information retrieval , 2003, SIGIR '03.

[56]  Antonio Gulli,et al.  Building an open source meta-search engine , 2005, WWW '05.