Augmenting Search with Corpus-derived Semantic Relevance

This paper describes a system for doing contextually-steered web search. The system is based on a method for estimating the semantic relevance of a web page to a query. Consider doing a web search for conferences about web search. The query “search conferences” is not effective, as it produces results relevant for the most part to searching over conferences, rather than conferences on the topic of search. The system described in this paper enables queries of the form “sea ch conference context:pagerank”. Thecontextfield in this example specifies a preference for results semantically relevant to the term “pagerank”, although there is no requirement that said results contain the word “pagerank” itself. This a more semantic, less lexical way of refining the query than adding literal conjuncts. Contextual search, as implemented in this paper, is based on the Google (Google) search engine. For each query, the top one hundred search results are fetched from Google and sorted according to their relevance to the context query. Relevance is computed as a distance function between the vocabulary vectors associated with a web-page and a query. For queries, the vocabulary vector is formed by aggregating the web-pages in the search results for that query. For web-pages, the vocabulary vector is aggregated from that web-page and other web-pages nearby in link-space. 1 SEMANTIC MODELS OF TEXT The Internet has a tremendous amount of information much of which is encoded in natural language. Human natural language is innately highly polysemous at both the word and phrasal level, so texts are rife with ambiguity. This is a problem for purely lexical search engines. One can refine an ambiguous query by successively adding qualifiers, but this can be time consuming and the variety of ways a given idea can be expressed can make the addition of query conjuncts dangerously restrictive. For contextual search we need a way to computationally model the semantics of short texts queries are usually no more than a few words and the amount of text on a web page can be as low as zero. What is needed is an approach that supports quick computations and requires no background knowledge. In the approach described in this paper, the semantic representation need only support a similarity operator (it is not necessary that, for instance, propositional information should be extractable from it.) Further requirements are that representations should be compact, should be noise tolerant, and should permit the comparison of arbitrary texts. Our solution is to use vectors of associated vocabulary to model the semantics of queries and web pages. For a query, we obtain a vocabulary vector by doing a web-search on that query (on (Google), for instance), taking all the snippets associated with each of the top 100 search results and breaking them into a bag-of-words representation. A more thorough approach is fetching N result links from the web search, follow them, and amalgamating their text. The disadvantage to this approach is the time required web-pages may be served slowly, in practice averaging on the order of seconds to load, and in any event this approach is bandwidth intensive. Empirically, we find that the expanded representation obtained from using whole web pages rather than snippets does not improve performance (probably because with snippets performance is already very 367 Mason Z. (2007). AUGMENTING SEARCH WITH CORPUS-DERIVED SEMANTIC RELEVANCE. In Proceedings of the Third International Conference on Web Information Systems and Technologies Web Interfaces and Applications, pages 367-371 DOI: 10.5220/0001259403670371 Copyright c © SciTePress Table 1: Most frequent symbols from vocab. vector of query “pagerank”. count symbol 15 software 14 tutorials 12 technology 11 programming 11 development 9 applets 7 articles 6 project 6 enterprise 6 edition 6 developers 6 comprehensive 6 books 5 virtual 5 training high.) After filtering out stop-words, the average number of distinct symbols in the snippet-based representation of a keyword for 100 snippets is, in a sample of ten thousand representations, 710. Table 1 has the top fifteen symbols in the representation for “java”, which has 512 distinct symbols and a total of 811 symbols. To get the vocabulary vector for a web-page we start by taking the text in the web-page and breaking it up into a bag-of-words. Unfortunately, many web pages have relatively little text. They might be succinct, or they might be stubs, or they might be nexuses linking to content but offering little direct content themselves. Low vocabulary counts are, with this classification method, likely to lead to poor accuracy. We solve this problem and expand the vocabulary associated with a web page by recursively downloading the pages to which the base result page links, up to a given maximum depth (in this case, 3), and provided that the links are on the same host as the original link. The vocabulary vector for each page so spidered is normalized so that its magnitude is constant. Also, each page is assigned a weight equal to 1 2n wheren is its distance in links from the root page. Finally, since obtaining the html for web pages is relatively costly (taking up to a few seconds per page) we limit the number of pages required by setting a maximum depth and, for web pages having more than ten links, choosing ten links at random. In practice, this produces a characteristic vocabulary vector with on the order of four thousand distinct terms (after stop words and extraneous matter like java-script code have been discarded), which provides Table 2: Most significant unigrams for “William Gibson”. count symbol 56.0 collector 48.7 gibson 8.7 william 8.4 neuromancer 8.2 book 6.2 buy 4.8 novel 4.5 active 4.0 wait 4.0 request 4.0 eve 3.9 science 3.7 fiction 3.7 award 3.6 recognition 3.5 pattern sufficient contextual discernment for our purposes. It is easy to imagine this approach to modelling the semantics of web-pages failing. Web-pages often link to pages that are only peripherally relevant, or contain text that is digressive or irrelevant. Nevertheless, empirically (see below) this method works well. One of the queries discussed below is “gisbon context:neuromancer” one of the most relevant result pages for this query is http://www.williamgibsonbooks.com/ , a part of whose characterization is in table 2 We compare semantic models using a Naive Bayes classifier. We approximated lexical prior probabilities by reference tothe British National Corpus (Leech et al 2001), which lists every word (and its frequency) in a large, heterogenous cross section of English documents, along with its frequency. The score given in the tables below is the natural log probability of the normalized vocabulary vector of the web page being generated by the normalized vocabulary vector of the contextual query, divided by the number of symbols in the latter vector.