论文信息 - Content based web spam detection using naive bayes with different feature representation technique

Content based web spam detection using naive bayes with different feature representation technique

Web Spam Detection is the processing to organize the search result according to specified criteria. Most often this refers to the automatic processing of search result, but the term also applies to the automatic classification of search results into ham and spam. Our work also evaluates change in performance by using different representation for the document vector like term frequency (TF), Binary, inverse document frequency (IDF) and TF-IDF. There are various Benchmark Datasets available for researchers related to web spam filtering. There has been significant effort to generate public benchmark datasets for anti- web spam filtering. One of the main concerns is how to protect the privacy of the users whose ham links are included in the datasets. We perform a statistical analysis of a large collection of WebPages, focusing on spam detection. Dimension reduction is important part of classification because it provides ease to visualize high dimensional data. This work reduce dimension of training data in 2D and full and mapped training and test data in to vector space. There are several classification here we use Naive Bayes classification and train data set with varying different representation and testing perform with different spam ham ratio Key-Words: - Content spam, keyword count, variety, density and Hidden or invisible text

Abhishek Mathur | Amit Anand Soni

[1] Calton Pu,et al. Guarding the next Internet frontier: countering denial of information attacks , 2002, NSPW '02.

[2] Constantine D. Spyropoulos,et al. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[3] Stefan Savage,et al. Spamscatter: Characterizing Internet Scam Hosting Infrastructure , 2007, USENIX Security Symposium.

[4] Georgios Paliouras,et al. An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[5] Sholom M. Weiss,et al. Automated learning of decision rules for text categorization , 1994, TOIS.

[6] David Carmel,et al. The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[7] Georgios Paliouras,et al. Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[8] Georgios Paliouras,et al. Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[9] Alessandro Acquisti,et al. Imagined Communities: Awareness, Information Sharing, and Privacy on the Facebook , 2006, Privacy Enhancing Technologies.