Weighting in Information Retrieval Using Genetic Programming: A Three Stage Process

This paper presents term-weighting schemes that have been evolved using genetic programming in an adhoc Information Retrieval model. We create an entire term-weighting scheme by firstly assuming that term-weighting schemes contain a global part, a term-frequency influence part and a normalisation part. By separating the problem into three distinct phases we reduce the search space and ease the analysis of the schemes generated by the process. Evolutionary computation techniques are proving to be a viable alternative to other standard analytical methods in many areas of IR. Genetic Programming (GP) [2] is an automated searching algorithm inspired by biological evolution. GP has been shown to be an effective approach to learning term-weighting schemes in IR [5]. Firstly, we evolve weighting schemes in a global domain which promote the best terms to use in distinguishing documents. Then, using a suitable global scheme, we evolve term-frequency influence schemes which uses the within-document term-frequency to correctly weight the term-frequency factor. Finally, we evolve normalisation schemes based on the best performing combined global and term-frequency scheme. This framework is an extension of work carried out in [1]. Most term-weighting schemes combine these three aspects to weight query terms and thus score a document in relation to a query.