Query-sets: using implicit feedback and query patterns to organize web documents

In this paper we present a new document representation model based on implicit user feedback obtained from search engine queries. The main objective of this model is to achieve better results in non-supervised tasks, such as clustering and labeling, through the incorporation of usage data obtained from search engine queries. This type of model allows us to discover the motivations of users when visiting a certain document. The terms used in queries can provide a better choice of features, from the user's point of view, for summarizing the Web pages that were clicked from these queries. In this work we extend and formalize as "query model" an existing but not very well known idea of "query view" for document representation. Furthermore, we create a novel model based on "frequent query patterns" called the "query-set model". Our evaluation shows that both "query-based" models outperform the vector-space model when used for clustering and labeling documents in a website. In our experiments, the query-set model reduces by more than 90% the number of features needed to represent a set of documents and improves by over 90% the quality of the results. We believe that this can be explained because our model chooses better features and provides more accurate labels according to the user's expectations.

[1]  Oren Etzioni,et al.  Adaptive Web Sites: an AI Challenge , 1997, IJCAI.

[2]  George Karypis,et al.  LPMiner: an algorithm for finding frequent itemsets using length-decreasing support constraint , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Mohamed S. Kamel,et al.  Phrase-based document similarity based on an index graph model , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  Jaideep Srivastava,et al.  Discovery of Interesting Usage Patterns from Web Data , 1999, WEBKDD.

[5]  Maguelonne Teisseire,et al.  Using data mining techniques on Web access logs to dynamically improve hypertext structure , 1999, LINK.

[6]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[7]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[8]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[9]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[10]  Paolo Tonella,et al.  Using keyword extraction for Web site clustering , 2003, Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings..

[11]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[12]  Ricardo A. Baeza-Yates,et al.  A Website Mining Model Centered on User Queries , 2005, EWMF/KDO.

[13]  Myra Spiliopoulou,et al.  Analysis of navigation behaviour in web sites integrating multiple information systems , 2000, The VLDB Journal.

[14]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[15]  Wei-Ying Ma,et al.  Log mining to improve the performance of site search , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering (Workshops), 2002..

[16]  Ricardo Baeza-Yates,et al.  Web Usage Mining in Search Engines , 2005 .

[17]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[18]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[19]  Michalis Vazirgiannis,et al.  Web personalization integrating content semantics and navigational patterns , 2004, WIDM '04.

[20]  Wagner Meira,et al.  Set-based vector model: An efficient approach for correlation-based ranking , 2005, TOIS.

[21]  Jun Hong,et al.  PageCluster: Mining conceptual link hierarchies from Web log files for adaptive Web site navigation , 2004, TOIT.

[22]  Ricardo A. Baeza-Yates,et al.  Query Clustering for Boosting Web Page Ranking , 2004, AWIC.

[23]  ChengXiang Zhai,et al.  Learn from web search logs to organize search results , 2007, SIGIR.

[24]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[25]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[26]  Benjamin Piwowarski,et al.  Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[27]  Yong Wang,et al.  Document Clustering using Compound Words , 2005, IC-AI.

[28]  Malú Castellanos HotMiner: Discovering Hot Topics from Dirty Text , 2004 .

[29]  Myra Spiliopoulou,et al.  Web usage mining for Web site evaluation , 2000, CACM.

[30]  Robin Burke,et al.  USING CONCEPT HIERARCHIES TO ENHANCE USER QUERIES IN WEB-BASED INFORMATION RETRIEVAL , 2003 .

[31]  Ricardo A. Baeza-Yates,et al.  A content and structure website mining model , 2006, WWW '06.

[32]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[33]  P. Tan,et al.  WebSIFT : The Web Site Information Filter , 1999 .

[34]  Qiang Yang,et al.  A comparison of implicit and explicit links for web page classification , 2006, WWW '06.

[35]  Ricardo A. Baeza-Yates,et al.  Improving search engines by query clustering , 2007, J. Assoc. Inf. Sci. Technol..