Spam: A Big Data Challenge

Spam consists of varieties of contents like text, image, embedded HTML, MIME attachments and also the volume of spam mails sent per day is massive. To handle this high volume, high velocity and large varieties of spam, a scalable spam filtering solution is required. Scalable solutions available for machine learning and statistical studies can be used to implement a scalable solution for spam filtering also. From Big data Analytics domain, Mahout is an open source library from Apache for building scalable solutions in machine learning. This paper uses mahout framework to analyse the time and accuracy efficiencies of the results of two Naive Bayes classification algorithms. Keywords: Apache Mahout, big data, scalable algorithms, Naive Bayes algorithms