Web mining techniques for query log analysis and expertise retrieval

With the large increase in the amount of information available online, rich Web data can be obtained on the Internet, such as over one trillion Web pages, millions of scientific literature, and different interactions with society, like question answers, query logs. Currently, Web mining techniques has emerged as an important research area to help Web users find their information need. In general, Web users express their information need as queries, and expect to obtain the needed information from the Web data through Web mining techniques. To better understand what users want in terms of the given query, it is very essential to analyze the query logs. On the other hand, the returned information may be Web pages, images, and other types of data. Beyond the traditional information, it would be quite interesting and important to identify relevant experts with expertise for further consulting about the query topic, which is also called expertise retrieval. The objective of this thesis is to establish automatic content analysis methods and scalable graph-based models for query log analysis and expertise retrieval. One important aspect of this thesis is therefore to develop a framework to combine the content information and the graph information with the following two purposes: 1) analyzing Web contents with graph structures, more specifically, mining query logs; and 2) identifying high-level information needs, such as expertise retrieval, behind the contents. For the first purpose, a novel entropy-biased framework is proposed for modeling bipartite graphs, which is applied to the click graph for better query representation by treating heterogeneous query-URL pairs differently and diminishing the effect of noisy links. Based on the graph information, there is a lack of constraints to make sure the final relevance of the score propagation on the graph. To tackle this problem, a general Co-HITS algorithm is developed to incorporate the bipartite graph with the content information from both sides as well as the constraints of relevance. Extensive evaluations on query log analysis demonstrate the effectiveness of the proposed models. For the second purpose, a weighted language model is proposed to aggregate the expertise of a candidate from the associated documents. The model not only considers the relevance of documents against a given query, but also incorporates important factors of the documents in the form of document priors. Moreover, an important approach is presented to boost the expertise retrieve by incorporating the content with other implicit link information through the graph-based re-ranking model. Furthermore, two community-aware strategies are developed and investigated to enhance the expertise retrieval, which are motivated by the observation that communities could provide valuable insight and distinctive information. Experimental results on the expert finding task demonstrate these methods can improve and enhance traditional the traditional expertise retrieval models with better performance.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[3]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[4]  M. Newman Coauthorship networks and patterns of scientific collaboration , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[6]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[7]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[8]  Shenghua Bao,et al.  Research on Expert Search at Enterprise Track of TREC 2006 , 2005, TREC.

[9]  Michael R. Lyu,et al.  A generalized Co-HITS algorithm and its application to bipartite graphs , 2009, KDD.

[10]  Michael J. Pazzani,et al.  Mining for proposal reviewers: lessons learned at the national science foundation , 2006, KDD '06.

[11]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[12]  Xiaojin Zhu,et al.  Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning , 2005, ICML.

[13]  Ryen W. White,et al.  WWW 2007 / Track: Browsers and User Interfaces Session: Personalization Investigating Behavioral Variability in Web Search , 2022 .

[14]  ChengXiang Zhai,et al.  Probabilistic Models for Expert Finding , 2007, ECIR.

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[17]  Craig MacDonald,et al.  Expertise drift and query expansion in expert search , 2007, CIKM '07.

[18]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[19]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[20]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[21]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[22]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[23]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[24]  Qiang Yang,et al.  Mining Web Query Hierarchies from Clickthrough Data , 2007, AAAI.

[25]  Ricardo A. Baeza-Yates,et al.  Query Recommendation Using Query Logs in Search Engines , 2004, EDBT Workshops.

[26]  Aristides Gionis,et al.  Dr. Searcher and Mr. Browser: a unified hyperlink-click graph , 2008, CIKM '08.

[27]  Eugene Garfield,et al.  Citation indexing - its theory and application in science, technology, and humanities , 1979 .

[28]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[29]  Tao Qin,et al.  Learning to rank relational objects and its application to web search , 2008, WWW.

[30]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[31]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[32]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[33]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[34]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[35]  Qiang Yang,et al.  Building bridges for web query classification , 2006, SIGIR.

[36]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[37]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[38]  Ophir Frieder,et al.  Automatic classification of Web queries using very large unlabeled query logs , 2007, TOIS.

[39]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[40]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[41]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[42]  Georgios Paliouras,et al.  Web Usage Mining as a Tool for Personalization: A Survey , 2003, User Modeling and User-Adapted Interaction.

[43]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[44]  William W. Cohen,et al.  Contextual search and name disambiguation in email using graphs , 2006, SIGIR.

[45]  Michael R. Lyu,et al.  Learning latent semantic relations from clickthrough data for query suggestion , 2008, CIKM '08.

[46]  Susan T. Dumais,et al.  To personalize or not to personalize: modeling queries with variation in user intent , 2008, SIGIR '08.

[47]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[48]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[49]  Juan-Zi Li,et al.  Expert Finding in a Social Network , 2007, DASFAA.

[50]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[51]  W. Bruce Croft,et al.  Proximity-based document representation for named entity retrieval , 2007, CIKM '07.

[52]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[53]  Ryen W. White,et al.  Mining the search trails of surfing crowds: identifying relevant websites from user activity , 2008, WWW.

[54]  Hongbo Deng,et al.  Enhancing expertise retrieval using community-aware strategies , 2009, CIKM.

[55]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[56]  Carlos A. Hurtado,et al.  A Statistical Model of Query Log Generation , 2006, SPIRE.

[57]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[58]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[59]  Massimo Marchiori,et al.  The Quest for Correct Information on the Web: Hyper Search Engines , 1997, Comput. Networks.

[60]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[61]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[62]  S. Wasserman,et al.  Models and Methods in Social Network Analysis , 2005 .

[63]  Yunhao Liu,et al.  EOS: expertise oriented search using social networks , 2007, WWW '07.

[64]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[65]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[66]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[67]  Hongyuan Zha,et al.  Probabilistic models for discovering e-communities , 2006, WWW '06.

[68]  Hongbo Deng,et al.  Entropy-biased models for query representation on the click graph , 2009, SIGIR.

[69]  Hongbo Deng,et al.  Effective latent space graph-based re-ranking model with global consistency , 2009, WSDM '09.

[70]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[71]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[72]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[73]  Vijay V. Raghavan,et al.  Information Retrieval on the World Wide Web , 1997, IEEE Internet Comput..

[74]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[75]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[76]  Thomas Hofmann,et al.  Semi-supervised Learning on Directed Graphs , 2004, NIPS.

[77]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[78]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[79]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[80]  Edward A. Fox,et al.  Research Contributions , 2014 .

[81]  W. Bruce Croft,et al.  Hierarchical Language Models for Expert Finding in Enterprise Corpora , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[82]  Oren Etzioni,et al.  The World-Wide Web: quagmire or gold mine? , 1996, CACM.

[83]  Mark T. Maybury,et al.  Expert Finding for Collaborative Virtual Environments , 2001, CACM.

[84]  Ryen W. White,et al.  Enhancing Expert Finding Using Organizational Hierarchies , 2009, ECIR.

[85]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[86]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[87]  Craig MacDonald,et al.  Voting for candidates: adapting data fusion techniques for an expert search task , 2006, CIKM '06.

[88]  S. Robertson The probability ranking principle in IR , 1997 .

[89]  Nick Craswell,et al.  Overview of the TREC 2005 Enterprise Track , 2005, TREC.

[90]  Johan Bollen,et al.  Co-authorship networks in the digital library research community , 2005, Inf. Process. Manag..

[91]  Pattie Maes,et al.  Agents that reduce work and information overload , 1994, CACM.

[92]  Arjen P. de Vries,et al.  Relevance information: a loss of entropy but a gain for IDF? , 2005, SIGIR '05.

[93]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[94]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[95]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[96]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[97]  Kwong-Sak Leung,et al.  Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space , 2006, NIPS.

[98]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[99]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[100]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[101]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[102]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[103]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[104]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[105]  Lai-Wan Chan Analysis of the Internal Representations in Neural Networks for Machine Intelligence , 1991, AAAI.

[106]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[107]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[108]  Johan Bollen,et al.  An algorithm to determine peer-reviewers , 2006, CIKM '08.

[109]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[110]  Kwong-Sak Leung,et al.  An expanding self-organizing neural network for the traveling salesman problem , 2004, Neurocomputing.

[111]  Craig MacDonald,et al.  High Quality Expertise Evidence for Expert Search , 2008, ECIR.

[112]  M. de Rijke,et al.  Determining Expert Profiles (With an Application to Expert Finding) , 2007, IJCAI.

[113]  Richard E. Rubin Foundations of Library and Information Science. , 1998 .

[114]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[115]  Ji-Rong Wen,et al.  WWW 2007 / Track: Search Session: Personalization A Largescale Evaluation and Analysis of Personalized Search Strategies ABSTRACT , 2022 .

[116]  Ricardo Baeza-Yates,et al.  Query-sets: using implicit feedback and query patterns to organize web documents , 2008, WWW.

[117]  Hector Garcia-Molina,et al.  The Eigentrust algorithm for reputation management in P2P networks , 2003, WWW '03.

[118]  Xin Jin,et al.  Web usage mining based on probabilistic latent semantic analysis , 2004, KDD.

[119]  Ji-Rong Wen,et al.  Scalable community discovery on textual data with relations , 2008, CIKM '08.

[120]  Ricardo A. Baeza-Yates,et al.  Extracting semantic relations from query logs , 2007, KDD '07.

[121]  M. de Rijke,et al.  Broad expertise retrieval in sparse data environments , 2007, SIGIR.

[122]  Andrew McCallum,et al.  Expertise modeling for matching papers with reviewers , 2007, KDD '07.

[123]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[124]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[125]  Oren Kurland,et al.  Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models , 2006, SIGIR.

[126]  Kenneth Ward Church,et al.  Query suggestion using hitting time , 2008, CIKM '08.

[127]  Rong Jin,et al.  Title language model for information retrieval , 2002, SIGIR '02.

[128]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[129]  M. de Rijke,et al.  Formal models for expert finding in enterprise corpora , 2006, SIGIR.

[130]  ChengXiang Zhai,et al.  Learn from web search logs to organize search results , 2007, SIGIR.

[131]  Chris H. Q. Ding,et al.  PageRank, HITS and a unified framework for link analysis , 2002, SIGIR '02.

[132]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[133]  Giuseppe Attardi,et al.  Ranking very many typed entities on wikipedia , 2007, CIKM '07.

[134]  Soumen Chakrabarti,et al.  Data mining for hypertext: a tutorial survey , 2000, SKDD.

[135]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[136]  Djoerd Hiemstra,et al.  Modeling multi-step relevance propagation for expert finding , 2008, CIKM '08.

[137]  Soumen Chakrabarti,et al.  Learning to rank networked entities , 2006, KDD '06.

[138]  Teuvo Kohonen,et al.  An introduction to neural computing , 1988, Neural Networks.

[139]  Hongbo Deng,et al.  Formal Models for Expert Finding on DBLP Bibliography Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[140]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[141]  Sepandar D. Kamvar,et al.  An Analytical Comparison of Approaches to Personalizing PageRank , 2003 .

[142]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[143]  Filip Radlinski,et al.  Active exploration for learning rankings from clickthrough data , 2007, KDD '07.

[144]  Qiang Yang,et al.  Query enrichment for web-query classification , 2006, TOIS.

[145]  Nick Craswell,et al.  Overview of the TREC 2006 Enterprise Track , 2006, TREC.

[146]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[147]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[148]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.