Addressing the challenges of underspecification in web search

The World Wide Web contains information on a scale far beyond the capacity of manual organization methods. Web search engines help users sift through that information to find data of interest through keyword searches, while also driving a multi-billion dollar advertising industry on the Web. Searching through all of the data on the Web to find the most relevant content is an enormous task, often exacerbated by underspecification and ambiguity in the queries posed by users or the underlying data itself. Users frequently omit relevant context or submit multifaceted queries, authors rarely provide explicit keywords or categorizations, and content is often missing relevant keywords. Uncertainty leads to inherent difficulty for search engines to find the best information for a particular user and query. We investigate these problems and propose techniques to effectively satisfy the needs of users and advertisers when a search engine encounters such uncertainty. The main challenges we address consist of: (1) Discovering which queries or keywords may benefit from contextualization. We propose a framework for automatically identifying geo-localizable queries, establishing several features measurable from search query logs which enable traditional machine learning algorithms to classify queries with high accuracy. (2) Given an ambiguous query, determining the most likely user requirements for each of the possible subtopics and then selecting a diverse set of pages to satisfy the greatest number of users. We describe a model for user satisfaction with a returned set of pages and propose a greedy algorithm for diversifying search results tailored towards the requirements of informational queries, when users frequently require more than one relevant result. We demonstrate notable improvement over current ranking strategies. (3) Identifying the pertinent keywords from sparse or imprecise content. We study two approaches for generating keywords from the text content of videos and investigate related term mining approaches to overcome potential mismatches between these keywords and the keywords chosen by searchers or advertisers. We perform extensive evaluations to highlight under what conditions each method generates the most relevant keywords. This dissertation presents and evaluates methods and algorithms which may benefit search engines, their users, and their advertising partners for a significant fraction of search instances and exabytes of data.

[1]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[2]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[3]  Thomas S. Huang,et al.  Automatic Video Annotation by Mining Speech Transcripts , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[4]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[5]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[6]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[7]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[8]  Yifan Chen,et al.  Advertising keyword suggestion based on concept hierarchy , 2008, WSDM '08.

[9]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[10]  Michael R. Lyu,et al.  Learning latent semantic relations from clickthrough data for query suggestion , 2008, CIKM '08.

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[13]  Junghoo Cho,et al.  On the Evolution of Wikipedia , 2007, ICWSM.

[14]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[15]  Ying Li,et al.  Detecting dominant locations from search queries , 2005, SIGIR '05.

[16]  Yong Yu,et al.  Identification of ambiguous queries in web search , 2009, Inf. Process. Manag..

[17]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[18]  Filip Radlinski,et al.  Improving personalized web search using result diversification , 2006, SIGIR.

[19]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[20]  Ping Zhang,et al.  UNDERSTANDING CONSUMERS ATTITUDE TOWARD ADVERTISING , 2002 .

[21]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[22]  Massimiliano Pontil,et al.  Support Vector Machines for 3D Object Recognition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[25]  Junghoo Cho,et al.  Generating advertising keywords from video content , 2010, CIKM '10.

[26]  Rajeev Motwani,et al.  Keyword Generation for Search Engine Advertising , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[27]  Berthier A. Ribeiro-Neto,et al.  Impedance coupling in content-targeted advertising , 2005, SIGIR '05.

[28]  Byoung-Tak Zhang,et al.  Text filtering by boosting naive Bayes classifiers , 2000, SIGIR '00.

[29]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[30]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[31]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[32]  Junghoo Cho,et al.  Automatically identifying localizable queries , 2008, SIGIR '08.

[33]  B. S. Manjunath,et al.  Automatic video annotation through search and mining , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[34]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[35]  Clement T. Yu,et al.  Personalized web search by mapping user queries to categories , 2002, CIKM '02.

[36]  Mark Sanderson,et al.  Automatic video tagging using content redundancy , 2009, SIGIR.

[37]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[39]  Mark Sanderson,et al.  Ambiguous queries: test collections need more sense , 2008, SIGIR '08.

[40]  Wei Vivian Zhang,et al.  Geographic intention and modification in web search , 2008, Int. J. Geogr. Inf. Sci..

[41]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[42]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[43]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[44]  Ravi Kumar,et al.  Searching with context , 2006, WWW '06.

[45]  Eric Horvitz,et al.  Patterns of search: analyzing and modeling Web query refinement , 1999 .

[46]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[47]  Vipin Kumar,et al.  Expert agreement and content based reranking in a meta search environment using Mearf , 2002, WWW '02.

[48]  Luis Gravano,et al.  Categorizing web queries according to geographical locality , 2003, CIKM '03.

[49]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[50]  Hema Raghavan,et al.  Discovering users' specific geo intention in web search , 2009, WWW '09.

[51]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[52]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[53]  Susan Gauch,et al.  Personalizing Search Based on User Search Histories , 2004 .

[54]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[55]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[56]  Bernard J. Jansen,et al.  Sponsored search: an overview of the concept, history, and technology , 2008, Int. J. Electron. Bus..

[57]  Reiner Kraft,et al.  Mining anchor text for query refinement , 2004, WWW '04.

[58]  Zhenyu Liu,et al.  Analysis of User Web Traffic with A Focus on Search Activities , 2005, WebDB.

[59]  David R. Karger,et al.  Less is More Probabilistic Models for Retrieving Fewer Relevant Documents , 2006 .

[60]  Christopher Olston,et al.  Search result diversity for informational queries , 2011, WWW.

[61]  Saturnino Luz,et al.  Automatic Hypertext Keyphrase Detection , 2005, IJCAI.

[62]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[63]  Peter D. Turney Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[64]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[65]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[66]  Andrei Z. Broder,et al.  A semantic approach to contextual advertising , 2007, SIGIR.

[67]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[68]  Alexander G. Hauptmann Lessons for the Future from a Decade of Informedia Video Analysis Research , 2005, CIVR.

[69]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[70]  Vibhanshu Abhishek,et al.  Keyword generation for search engine advertising using semantic similarity between terms , 2007, ICEC.

[71]  Andrei Z. Broder,et al.  Automatic generation of bid phrases for online advertising , 2010, WSDM '10.

[72]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[73]  Alexander G. Hauptmann,et al.  Speech recognition in the Informedia Digital Video Library: uses and limitations , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[74]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[75]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[76]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[77]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.

[78]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[79]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[80]  Pietro Perona,et al.  A walk through the web’s video clips , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[81]  Daniel E. Rose,et al.  Understanding user goals in web search , 2004, WWW '04.

[82]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[83]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[84]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[85]  Jun Wang,et al.  Portfolio theory of information retrieval , 2009, SIGIR.

[86]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[87]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[88]  Alexander Pretschner,et al.  Ontology based personalized search , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[89]  Ingemar J. Cox,et al.  Risk-Aware Information Retrieval , 2009, ECIR.