Spam is defined as the abuse of electronic messaging systems to indiscriminately send unsolicited bulk messages. Generally, a spam is e-mail advertising for some product sent to a mailing list or newsgroup. E-mail spam is defined as the subset of electronic spam involving nearly identical message sent to numerous recipients through e-mail. It is also specified as a junk e-mail or unsolicited bulk e-mail. While clicking links on the spam e-mail, it may harm the computer. Because, the spam mail includes malware as scripts or other executable file attachments. The E-mail addresses are collected by spammer from chat rooms, websites, and newsgroups and sold to other spammer. In order to filter the messages and separate the genuine messages from the junk mail, the spam filters are preferred. But, the existing filters generally perform well when dealing with clumsy spams. It has suspicious duplicate content with keywords or sent from an identical notorious server. So, it cannot able to efficiently match the incoming e-mail with huge database. The main contribution of this work is the proposal of three main techniques. One is Structure Abstraction Generation (SAG) to generate E-Mail Abstraction Plot (E-MAP) using HTML content in e-mail. Next, innovative tree structure Sp trees is used to store large amount of e-mail abstraction of reported spams. Finally, a design of complete spam detection system cosdes with an efficient near-duplicate matching scheme and progressive update scheme. The progressive update scheme enables the system to keep most up-to-date information for near-duplicate detection. The spam detection result of each incoming e-mail can be determined by near-duplicate similarity matching process. A reputation mechanism also used to withstand from intentional attacks. By integrating the above techniques the application is developed efficiently using Visual Studio .NET 2008 as frontend and SQL Server 2005 as a backend. The code used behind the work is C#, .NET.
[1]
Jian Pei,et al.
Link spam target detection using page farms
,
2009,
TKDD.
[2]
Jian Pei,et al.
Data Mining: Concepts and Techniques, 3rd edition
,
2006
.
[3]
Constantine D. Spyropoulos,et al.
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
,
2000,
SIGIR '00.
[4]
Qiang Wu,et al.
Improving web spam classification using rank-time features
,
2007,
AIRWeb '07.
[5]
Harris Drucker,et al.
Support vector machines for spam categorization
,
1999,
IEEE Trans. Neural Networks.
[6]
Marc Najork,et al.
Detecting phrase-level duplication on the world wide web
,
2005,
SIGIR '05.
[7]
J. Morris Chang,et al.
An Effective Method for Combating Malicious Scripts Clickbots
,
2009,
ESORICS.
[8]
Thomas Lavergne,et al.
Tracking Web Spam with Hidden Style Similarity
,
2006,
AIRWeb.
[9]
William S. Yerazunis,et al.
Spam filtering using a Markov random field model with variable weighting schemas
,
2004,
Fourth IEEE International Conference on Data Mining (ICDM'04).
[10]
Ernesto Damiani,et al.
An Open Digest-based Technique for Spam Detection
,
2004,
PDCS.
[11]
Tie-Yan Liu,et al.
BrowseRank: letting web users vote for page importance
,
2008,
SIGIR '08.
[12]
Jiawei Han,et al.
Data Mining: Concepts and Techniques
,
2000
.
[13]
Ernesto Damiani,et al.
P2P-based collaborative spam detection and filtering
,
2004
.
[14]
Wolfgang Nejdl,et al.
MailRank: using ranking for spam detection
,
2005,
CIKM '05.
[15]
Richard Clayton.
Email traffic: a quantitative snapshot
,
2007,
CEAS.
[16]
Ming-Wei Chang,et al.
Partitioned logistic regression for spam filtering
,
2008,
KDD.
[17]
Mads Haahr,et al.
Personalised, Collaborative Spam Filtering
,
2004,
CEAS.
[18]
András A. Benczúr,et al.
Web spam classification: a few features worth more
,
2011,
WebQuality '11.