REDUCE: a semi-supervised scalable approach for REsult DUplication detection in Search Engines

Abstract Search engines are among the most popular web services on the World Wide Web. They facilitate the process of finding information using a query-result mechanism. However, results returned by search engines contain a lot of duplications. For instance, when a user searches for a query q on Google, he will get n number of links divided into m number of pages. Although returned links have different content, exactly similar content is still existing on different links returned by search engines. This problem is referred to as duplication. Solving this problem, will increase the quality of search results as well as reducing the time search per query. In this paper, we introduce a new method called REDUCE (REsult DUplication detection in searCh Engines), to address this problem. It implements a semi-supervised approach. It approximately measures the similarity between the web pages and we suggest a new method to group the search results based on their similarity. To evaluate our method, we collect data from Google and other search engine platforms. We show that our method can solve this problem on different search engine platforms with different languages. We empirically evaluated our results on different classification algorithms and reached an accuracy of 96.7%.