Trustworthy Website Detection Based on Social Hyperlink Network Analysis

Trustworthy website detection plays an important role in providing users with meaningful web pages, from a search engine. Current solutions to this problem, however, mainly focus on detecting spam websites, instead of promoting more trustworthy ones. In this paper, we propose the enhanced OpinionWalk (EOW) algorithm to compute the trustworthiness of all websites and identify trustworthy websites with higher trust values. The proposed EOW algorithm treats the hyperlink structure of websites as a social network and applies social trust analysis to calculate the trustworthiness of individual websites. To mingle social trust analysis and trustworthy website detection, we model the trustworthiness of a website based on the quantity and quality of websites it points to. We further design a mechanism in EOW to record which websites’ trustworthiness need to be updated while the algorithm “walks” through the network. As a result, the execution of EOW is reduced by 27.1 percent, compared to the OpinionWalk algorithm. Using the public dataset, WEBSPAM-UK2006, we validate the EOW algorithm and analyze the impacts of seed selection, size of seed set, maximum searching depth and starting nodes, on the algorithm. Experimental results indicate that EOW algorithm identifies 5.35 to 16.5 percent more trustworthy websites, compared to TrustRank.

[1]  Juan Martínez-Romo,et al.  Web spam identification through language model analysis , 2009, AIRWeb '09.

[2]  Tie-Yan Liu,et al.  BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[3]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[4]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[5]  Arnon Rungsawang,et al.  Web Spam Detection Using Link-Based Ant Colony Optimization , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[6]  Ashutosh Kumar Singh,et al.  Distrust seed set propagation algorithm to detect web spam , 2016, Journal of Intelligent Information Systems.

[7]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[8]  Luca Becchetti,et al.  Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .

[9]  Malik Muneeb Abid,et al.  Catching Webspam traffic with Artificial Immune System (AIS) classification algorithm , 2016, 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[10]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[11]  Victor Valeriu Patriciu,et al.  Spam host classification using swarm intelligence , 2014, 2014 10th International Conference on Communications (COMM).

[12]  S. K. Jayanthi,et al.  WESPACT: — Detection of web spamdexing with decision trees in GA perspective , 2012, International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012).

[13]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[14]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[15]  Qing Yang,et al.  Uncovering the mystery of trust in an online social network , 2015, 2015 IEEE Conference on Communications and Network Security (CNS).

[16]  Ji Hua,et al.  Analysis on the content features and their correlation of web pages for spam detection , 2015, China Communications.

[17]  Wei Wang,et al.  OpinionWalk: An efficient solution to massive trust assessment in online social networks , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[18]  Li Shengen,et al.  Generating New Features Using Genetic Programming to Detect Link Spam , 2011, 2011 Fourth International Conference on Intelligent Computation Technology and Automation.

[19]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[20]  Xiaodong Lin,et al.  Assessment of multi-hop interpersonal trust in social networks by Three-Valued Subjective Logic , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[21]  Filip Radlinski,et al.  Addressing Malicious Noise in Clickthrough Data , 2007 .

[22]  Akebo Yamakami,et al.  Towards Web Spam Filtering Using a Classifier Based on the Minimum Description Length Principle , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[23]  Xianchao Zhang,et al.  Automatic seed set expansion for trust propagation based anti-spam algorithms , 2013, Inf. Sci..

[24]  Yongli Wang,et al.  A systematic framework to discover pattern for web spam classification , 2017, 2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).

[25]  D. R. Patil,et al.  Web spam detection using SVM classifier , 2015, 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO).

[26]  Yuanping Zhu,et al.  Fighting Link Spam with a Two-Stage Ranking Strategy , 2007, ECIR.

[27]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[28]  Xiaodong Lin,et al.  Itrust: interpersonal trust measurements from social interactions , 2016, IEEE Network.

[29]  Calton Pu,et al.  Predicting web spam with HTTP session information , 2008, CIKM '08.

[30]  Santosh Kumar,et al.  Novel Features for Web Spam Detection , 2016, 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI).

[31]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[32]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[33]  Brian D. Davison,et al.  Looking into the past to better classify web spam , 2009, AIRWeb '09.

[34]  Dongmei Zhang,et al.  Learning to Detect Web Spam by Genetic Programming , 2010, WAIM.

[35]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[36]  Behzad Moshiri,et al.  Designing a web spam classifier based on feature fusion in the Layered Multi-population Genetic Programming framework , 2013, Proceedings of the 16th International Conference on Information Fusion.

[37]  Bin Zhou,et al.  Effectively Detecting Content Spam on the Web Using Topical Diversity Measures , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[38]  Sanjay Kumar Jena,et al.  Sarcastic sentiment detection in tweets streamed in real time: a big data approach , 2016, Digit. Commun. Networks.

[39]  Jing Wan,et al.  Detecting spam webpages through topic and semantics analysis , 2015, 2015 Global Summit on Computer & Information Technology (GSCIT).

[40]  Cuiling Zhu,et al.  Link spam detection based on genetic programming , 2010, 2010 Sixth International Conference on Natural Computation.

[41]  Baagyere Edward Yellakuor,et al.  Spam detection through link authorization from neighboring nodes , 2015, 2015 Forth International Conference on e-Technologies and Networks for Development (ICeND).

[42]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[43]  Huaxiang Zhang,et al.  Analysis on the content features and their correlation of web pages for spam detection , 2015 .

[44]  Yiqun Liu,et al.  Search engine click spam detection , 2012, 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence Systems.

[45]  Yiqun Liu,et al.  Identifying web spam with user behavior analysis , 2008, AIRWeb '08.

[46]  Wei-Pang Yang,et al.  Designing a classifier by a layered multi-population genetic programming approach , 2007, Pattern Recognit..

[47]  Valentin Sgarciu,et al.  Spam host classification using PSO-SVM , 2014, 2014 IEEE International Conference on Automation, Quality and Testing, Robotics.