DF or IDF? On the Use of Primary Feature Model for Web Information Retrieval

In Web information retrieval (IR), input queries are too short and fuzzy to describe user request, which leads to the mismatch problem between user query and the documents full of redundancy and noise. This paper first studies the feature of web documents information and proposes the concepts of primary feature word, primary feature field and primary feature space (PFS). Then a new PFS query term weighting scheme is proposed, which takes document frequency (DF) into account instead of the traditional IDF factor. Finally, a combination strategy of term weighting is given. Using this PFS model, three groups of experiments have been performed on 10G and 19G large scale Web collections with TREC9, TREC10 and TREC11 standard tests of Web tracks. Comparative studies indicate that the new DF-related PFS term weighting improves the system performance consistently and effectively in terms of recall, top n precision and mean average precision. At most 18.6% improvement has been made.