Spam Behavior Recognition Based on Session Layer Data Mining

Various approaches are presented to solve the growing spam problem. However, most of these approaches are inflexible to adapt to spam dynamically. This paper proposes a novel approach to counter spam based on spam behavior recognition using Decision Tree learned from data maintained during transfer sessions. A classification is set up according to email transfer patterns enabling normal servers to detect malicious connections before mail body delivered, which contributes much to save network bandwidth wasted by spams. An integrated Anti-Spam framework is founded combining the Behavior Classification with a Bayesian classification. Experiments show that the Behavior Classification has high precision rate with acceptable recall rate considering its bandwidth saving feature. The integrated filter has a higher recall rate than either of the sub-modules, and the precision rate remains quite close to the Bayesian Classification.

[1]  Dian Tjondronegoro,et al.  Aggregated cross-media news visualization and personalization , 2008, MIR '08.

[2]  Behrouz A. Forouzan TCP/IP Protocol Suite , 1999 .

[3]  Yohei Seki A Multilingual Polarity Classification Method using Multi-label Classification Technique Based on Corpus Analysis , 2008, NTCIR.

[4]  Grigorios Tsoumakas,et al.  Lazy Adaptive Multicriteria Planning , 2004, ECAI.

[5]  Grigorios Tsoumakas,et al.  Ensemble Pruning Using Reinforcement Learning , 2006, SETN.

[6]  Joachim M. Buhmann,et al.  Classification of Multi-labeled Data: A Generative Approach , 2008, ECML/PKDD.

[7]  Dimitrios Kalles,et al.  Measuring Expert Impact on Learning how to Play a Board Game , 2007, AIAI.

[8]  Wan-Chi Siu,et al.  Image annotation with parametric mixture model based multi-class multi-labeling , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[9]  Daniel S. Yeung,et al.  Quantitative study on the generalization error of multiple classifier systems , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Sonia Bergamaschi,et al.  Progettazione e sviluppo di un software per la visualizzazione di cluster di news , 2008 .

[11]  Jaeyoung Chang,et al.  Integrating Incremental Feature Weighting into NaÏve Bayes Text Classifier , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[12]  N. Chawla,et al.  Evolutionary Ensembles : Combining Learning Agents using Genetic Algorithms , 2005 .

[13]  Jeremy Frank,et al.  Using Data Mining to Enhance Automated Planning and Scheduling , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[14]  Guy W. Mineau,et al.  Distributed Data Mining: Why Do More Than Aggregating Models , 2007, IJCAI.

[15]  Guy W. Mineau,et al.  Le forage distribué des données: une méthode simple, rapide et efficace , 2006, EGC.

[16]  Ulrich Scholz,et al.  Reducing planning problems by path reduction , 2004 .

[17]  U. Scholz Domain Analysis and Domain Knowledge : Generation , Representation , and Implementation , 2007 .

[18]  D. Kalles,et al.  A Minimax Tutor for Learning to Play a Board Game , 2008 .

[19]  Frank Elberzhager,et al.  Predicting Defect Content and Quality Assurance Effectiveness by Combining Expert Judgment and Defect Data - A Case Study , 2008, 2008 19th International Symposium on Software Reliability Engineering (ISSRE).

[20]  Witold Pedrycz,et al.  Distributed and Collaborative Soft Computing: An Emerging Development Environment , 2007, 2007 International Conference on Computing: Theory and Applications (ICCTA'07).

[21]  Grigorios Tsoumakas,et al.  Protein Classification with Multiple Algorithms , 2005, Panhellenic Conference on Informatics.

[22]  Alexander Schliep,et al.  Ranking and selecting clustering algorithms using a meta-learning approach , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[23]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[24]  Latifur Khan,et al.  Multi-label large margin hierarchical perceptron , 2008, Int. J. Data Min. Model. Manag..

[25]  Sung-Bae Cho,et al.  Ensemble approaches in evolutionary game strategies: A case study in Othello , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[26]  Zhaohui Luo,et al.  Sea Water Pollution Assessment Based on Ensemble of Classifiers , 2008, 2008 Fourth International Conference on Natural Computation.

[27]  Grigorios Tsoumakas,et al.  Dynamic Feature Space and Incremental Feature Selection for the Classification of Textual Data Streams , 2006 .

[28]  S. Biundo,et al.  Semantic Web Technology as a Basis for Planning and Scheduling Systems , 2006 .

[29]  Tshilidzi Marwala,et al.  Multi-class Protein Sequence Classification Using Fuzzy ARTMAP , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[30]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[31]  Grigorios Tsoumakas,et al.  On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams , 2005, Panhellenic Conference on Informatics.

[32]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[33]  Arnaud Lallouet,et al.  Two Contributions of Constraint Programming to Machine Learning , 2005, ECML.

[34]  Arnaud Lallouet,et al.  Building Consistencies for Partially Defined Constraints with Decision Trees and Neural Networks , 2007, Int. J. Artif. Intell. Tools.

[35]  Guy W. Mineau,et al.  Rule Validation of a Meta-classifier Through a Galois (Concept) Lattice and Complementary Means , 2006, CLA.

[36]  Hakikur Rahman,et al.  Data Mining Applications for Empowering Knowledge Societies , 2008 .

[37]  Li Taoshen A New Approach of Stacking Based on Voting , 2006 .

[38]  Grigorios Tsoumakas,et al.  Email Mining: Emerging Techniques for Email Management , 2006 .

[39]  Geoff Holmes,et al.  Multi-label Classification Using Ensembles of Pruned Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[40]  Ioannis Stamelos,et al.  Software Defect Prediction Using Regression via Classification , 2006, IEEE International Conference on Computer Systems and Applications, 2006..

[41]  Shangteng Huang,et al.  Data privacy protection in multi-party clustering , 2008, Data Knowl. Eng..

[42]  Johannes Fürnkranz,et al.  Multi-Label Classification with Label Constraints , 2008 .

[43]  Thomas Lingner,et al.  Alignmentfreie Analyse von Proteinsequenzen mit Verfahren des maschinellen Lernens , 2008 .

[44]  Teresa Bernarda Ludermir,et al.  Selecting Neural Network Forecasting Models Using the Zoomed-Ranking Approach , 2008, 2008 10th Brazilian Symposium on Neural Networks.

[45]  Grigorios Tsoumakas,et al.  Multi-Label Classification of Music into Emotions , 2008, ISMIR.

[46]  D. Gorea Knowledge as a Service. An Online Scoring Engine Architecture , 2008, 2008 The Third International Multi-Conference on Computing in the Global Information Technology (iccgi 2008).

[47]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.

[48]  Johannes Fürnkranz,et al.  Advances in Efficient Pairwise Multilabel Classification , 2008 .

[49]  Alex Alves Freitas,et al.  Multi-label Hierarchical Classification of Protein Functions with Artificial Immune Systems , 2008, BSB.

[50]  Vasant Honavar,et al.  Learning classifiers from distributed, semantically heterogeneous, autonomous data sources , 2004 .

[51]  Christos Dimitrakakis,et al.  Ensembles for sequence learning , 2006 .

[52]  Thomas Michael Vernieri A Web Services Approach to Generating and Using Plans in Configurable Execution Environments , 2006 .

[53]  Chunsheng Yang,et al.  Learning to predict train wheel failures , 2005, KDD '05.

[54]  Patrick P. K. Chan,et al.  Neural network ensemble pruning using sensitivity measure in web applications , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[55]  Γεώργιος Συγλέτος Εξόρυξη γνώσης για εξαγωγή πληροφορίας από τον παγκόσμιο ιστό με χρήση τεχνικών ψηφοφορίας και συσσωρευμένης γενίκευσης , 2005 .

[56]  Grigorios Tsoumakas,et al.  An Ensemble of Classifiers for coping with Recurring Contexts in Data Streams , 2008, ECAI.

[57]  Minh Tran,et al.  FreeBSD server anti-spam software using automated TCP connection control , 2004 .

[58]  Teresa Bernarda Ludermir,et al.  Active Learning to Support the Generation of Meta-examples , 2007, ICANN.

[59]  Mohamed Aounallah,et al.  RULE CONFIDENCE PRODUCED FROM DISJOINT DATABASES : A STATISTICALLY SOUND WAY TO REGROUP RULES SETS , 2005 .

[60]  Yu Chongchong,et al.  The Application of PMML in Healthy Housing Evaluation and Rules Discovery Decision Support System , 2008, 2008 International Conference on Computer Science and Software Engineering.

[61]  Grigorios Tsoumakas,et al.  An Empirical Study of Lazy Multilabel Classification Algorithms , 2008, SETN.

[62]  Iraklis Varlamis,et al.  Mining Frequent Generalized Patterns for Web Personalization , 2008 .

[63]  Peretz Shoval,et al.  ONTOLOGY-BASED CLASSIFICATION OF NEWS IN AN ELECTRONIC NEWSPAPER , 2008 .

[64]  Salvatore J. Stolfo,et al.  EMT/MET: systems for modeling and detecting errant email , 2003, Proceedings DARPA Information Survivability Conference and Exposition.

[65]  Frederik Hjorth-Jensen,et al.  Instrument Detection in Music , 2008 .

[66]  Dimitri P. Solomatine,et al.  Modular learning models in forecasting natural phenomena , 2006, Neural Networks.

[67]  Lucas Drumond,et al.  A multi-agent legal recommender system , 2008, Artificial Intelligence and Law.

[68]  Jae-Young Chang,et al.  Improving Naïve Bayes Text Classifiers with Incremental Feature Weighting , 2008 .

[69]  Jian-Hua Xu,et al.  A multi-label classification algorithm based on triple class support vector machine , 2007, 2007 International Conference on Wavelet Analysis and Pattern Recognition.

[70]  Patrik Haslum,et al.  Domain Knowledge in Planning : Representation and Use , 2003 .

[71]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[72]  Ricardo Aler,et al.  MACHINE LEARNING IN HYBRID HIERARCHICAL AND PARTIAL-ORDER PLANNERS FOR MANUFACTURING DOMAINS , 2005, Appl. Artif. Intell..

[73]  Catherine Rosenberg,et al.  Behavioral authentication of server flows , 2003, 19th Annual Computer Security Applications Conference, 2003. Proceedings..

[74]  Ioannis Partalas,et al.  Ensemble Selection for Water Quality Prediction , 2007 .

[75]  Grigorios Tsoumakas,et al.  Greedy regression ensemble selection: Theory and an application to water quality prediction , 2008, Inf. Sci..

[76]  Dimitrios Kalles,et al.  Evolving Computer Game Playing via Human-Computer Interaction: Machine Learning Tools in the Knowledge Engineering Life-Cycle , 2008, JCKBSE.

[77]  Dimitrios Kalles,et al.  PLAYER CO-MODELLING IN A STRATEGY BOARD GAME: DISCOVERING HOW TO PLAY FAST , 2006, Cybern. Syst..

[78]  Christophe G. Giraud-Carrier,et al.  Temporal Data Mining in Dynamic Feature Spaces , 2006, Sixth International Conference on Data Mining (ICDM'06).

[79]  Hui Xiong,et al.  Distributed classification in peer-to-peer networks , 2007, KDD '07.

[80]  Eva Onaindia,et al.  An AI Planning-based Approach for Automated Design of Learning Routes , 2007 .

[81]  Cheng-Lin Liu Partial Discriminative Training of Neural Networks for Classification of Overlapping Classes , 2008, ANNPR.

[82]  D. Gorea Dynamically Integrating Knowledge in Applications An Online Scoring Engine Architecture , 2008 .

[83]  Ömer Nezih Gerek,et al.  On Feature Extraction for Spam E-Mail Detection , 2006, MRCS.

[84]  Grigorios Tsoumakas,et al.  Multilabel Text Classification for Automated Tag Suggestion , 2008 .

[85]  Jaap-Henk Hoepman,et al.  Spam Filter Analysis , 2004, SEC.

[86]  Marina Teresa Pires Vieira,et al.  A Hypotheses-based Method for Identifying Skewed Itemsets , 2004, SBBD.

[87]  Vipin Kumar,et al.  Incorporating functional inter-relationships into protein function prediction algorithms , 2009, BMC Bioinformatics.

[88]  Chunsheng Yang,et al.  Two-stage classifications for improving time-to-failure estimates: a case study in prognostic of train wheels , 2008, Applied Intelligence.

[89]  Athena Vakali,et al.  Web Data Management Practices: Emerging Techniques and Technologies , 2007 .

[90]  Reinhard Wilhelm,et al.  Predicting Component Failures at Early Design Time , 2006 .

[91]  Taeg Keun Whangbo,et al.  Distributed Data Mining on Clusters with Bayesian Mixture Modeling , 2005, FSKD.

[92]  Vasant Honavar,et al.  On Evaluating MHC-II Binding Peptide Prediction Methods , 2008, PloS one.

[93]  Witold Pedrycz,et al.  Knowledge-Based Clustering in Computational Intelligence , 2007, Challenges for Computational Intelligence.

[94]  Olga Brazhnik,et al.  Databases and the geometry of knowledge , 2007, Data Knowl. Eng..

[95]  ChengXiang Zhai,et al.  Multi-label literature classification based on the Gene Ontology graph , 2008, BMC Bioinformatics.

[96]  Miroslav Kubat,et al.  Induction of classifiers from multi-labeled examples: an information-retrieval point of view , 2007 .

[97]  Chaker Katar Combining Multiple Techniques for Intrusion Detection , 2006 .

[98]  Mohamed Aoun-Allah Le forage distribué des données : une approche basée sur l'agrégation et le raffinement de modèles , 2006 .

[99]  Thanapat Kangkachit,et al.  Concept Lattice-Based Mutation Control for Reactive Motifs Discovery , 2008, PAKDD.

[100]  Grigorios Tsoumakas,et al.  PersoNews: A Personalized News Reader Enhanced by Machine Learning and Semantic Filtering , 2006, OTM Conferences.

[101]  Wenjia Wang,et al.  Investigation on Diversity in Homogeneous and Heterogeneous Ensembles , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[102]  Witold Pedrycz,et al.  COLLABORATIVE AND KNOWLEDGE-BASED FUZZY CLUSTERING , 2007 .

[103]  Steve Mitchell Machine Assistance in Collection Building: New Tools, Research, Issues and Reflections , 2006 .

[104]  Michaël Rusinowitch,et al.  Protocol analysis in intrusion detection using decision tree , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[105]  Christopher Lueg Spam and Anti-Spam Measures: A Look at Potential Impacts , 2003 .

[106]  Eyke Hüllermeier,et al.  Case-Based Multilabel Ranking , 2007, IJCAI.

[107]  Peter Meinicke,et al.  Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach , 2008, WABI.

[108]  Georgios Meditskos,et al.  Web Services for Adaptive Planning , 2004 .

[109]  Kitsana Waiyamai,et al.  Prediction of Enzyme Class by Using Reactive Motifs Generated from Binding and Catalytic Sites , 2007, ADMA.

[110]  Yong Man Ro,et al.  Semantic Home Photo Categorization , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[111]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[112]  Sally I. McClean,et al.  Model-based Clustering on Semantically Heterogeneous Distributed Databases on the Internet , 2006, AAAI Fall Symposium: Semantic Web for Collaborative Knowledge Acquisition.

[113]  A. Howe,et al.  Learned Models of Performance for Many Planners , 2007 .

[114]  Nitin Kumar,et al.  Controlling spam Emails at the routers , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[115]  C. A. Murthy,et al.  Rough set Based Ensemble Classifier for Web Page Classification , 2006 .

[116]  Teresa Bernarda Ludermir,et al.  Selective generation of training examples in active meta-learning , 2008, Int. J. Hybrid Intell. Syst..

[117]  Wenjia Wang,et al.  On diversity and accuracy of homogeneous and heterogeneous ensembles , 2007, Int. J. Hybrid Intell. Syst..

[118]  Chen Ming,et al.  Flow-based anti-spam , 2004, 2004 IEEE International Workshop on IP Operations and Management.

[119]  Grigorios Tsoumakas,et al.  Effective Stacking of Distributed Classifiers , 2002, ECAI.

[120]  Grigorios Tsoumakas,et al.  Effective and Efficient Multilabel Classification in Domains with Large Number of Labels , 2008 .

[121]  Richard S. Sutton,et al.  Planning and Learning , 1998 .