REPTREE CLASSIFIER FOR IDENTIFYING LINK SPAM IN WEB SEARCH ENGINES

Search Engines are used for retrieving the information from the web. Most of the times, the importance is laid on top 10 results sometimes it may shrink as top 5, because of the time constraint and reliability on the search engines. Users believe that top 10 or 5 of total results are more relevant. Here comes the problem of spamdexing. It is a method to deceive the search result quality. Falsified metrics such as inserting enormous amount of keywords or links in website may take that website to the top 10 or 5 positions. This paper proposes a classifier based on the Reptree (Regression tree representative). As an initial step Link-based features such as neighbors, pagerank, truncated pagerank, trustrank and assortativity related attributes are inferred. Based on this features, tree is constructed. The tree uses the feature inference to differentiate spam sites from legitimate sites. WEBSPAM-UK-2007 dataset is taken as a base. It is preprocessed and converted into five datasets FEATA, FEATB, FEATC, FEATD and FEATE. Only link based features are taken for experiments. This paper focus on link spam alone. Finally a representative tree is created which will more precisely classify the web spam entries. Results are given. Regression tree classification seems to perform well as shown through experiments.

[1]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[2]  Carlos Castillo,et al.  Graph regularization methods for Web spam detection , 2010, Machine Learning.

[3]  Ming Ma,et al.  Strider Search Ranger: Towards an Autonomic Anti-Spam Search Engine , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[4]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  S. Sasikala,et al.  GAB_CLIQDET: - A Diagnostics to Web Cancer (Web Link Spam) Based on Genetic Algorithm , 2011 .

[7]  S. K. Jayanthi,et al.  WESPACT: — Detection of web spamdexing with decision trees in GA perspective , 2012, International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012).

[8]  Jian Pei,et al.  A Spamicity Approach to Web Spam Detection , 2008, SDM.

[9]  Panagiotis Takis Metaxas Using Propagation of Distrust to Find Untrustworthy Web Neighborhoods , 2009, 2009 Fourth International Conference on Internet and Web Applications and Services.