Estimating Search Engine Ranking Function with Latent Semantic Analysis and a Genetic Algorithm

This study seeks to obtain an estimation function f for the ranking function of Google, and then to compare the recalculated ranks with the actual ranks of the search results for a series of queries. We formulate the problem as a curve fitting process, that is, to construct a mathematical function that has the best fit to a series of search results for several queries. The proposed estimation function defines the score of a document as the weighed sum of scores from a limited set of factors including a document's title, snippet, URL, and its PageRank and MozRank. The set of terms semantically related to a given query and their associated relevance scores are obtained from latent semantic analysis of the search results retrieved for the query. The relative weights of importance of the factors are determined by a genetic algorithm. Experimental results indicate that the measured Kendall's Tau and R-Precision achieve the best with all factors included. Further, PageRank and MozRank reinforce each other.