Research on Web Spam Detection Based on Support Vector Machine

With the fast development of Internet, web pages created by web spam which aimed at cheating the search engine and increasing rankings in the search results are prevailing. Web spam is a big problem for today's search engine; therefore it is necessary for search engines to be able to detect web spam during crawling. The web spam detection problem is viewed as a classification problem, that means classification models are created by machine learning classification algorithms, which given a web page, it will classify it in one of two categories: Normal and Spam. For support vector machine classification model, soft margin classifier based on linear support vector machine was developed by learning the sample set, and penalty functions were defined according to the links between web pages that seems to have similar characteristics. Not only the content features but also the link structures between web pages were taken advantage of to build classifier.