Finding Rare Web Pages by Relevancy and Atypicality in a Category

In this paper, we propose rarity of a Web page in a category given by a user to find useful information that a few people know. A rare Web page is a page that belongs to a given category and that is atypical in the category. We define a probability that the page is a rare Web page in the given category as a rarity score. The rarity score is a product of a relevancy score and an a typicality score. The relevancy is a probability that a Web page belongs to a category given by a user. The a typicality is a conditional probability that a page is atypical in the category when it belongs to the category. Both probabilities are calculated by using tags of social bookmark services and words in Web pages. We evaluated the proposed relevancy score by classifying whether Web pages belong to a certain category. We also evaluated the proposed rarity as a metric for ranking Web pages, and compared the rankings by relevancy and a typicality. We confirmed usefulness of the rarity score to find relevant and atypical pages.