Using Optimization-Based Classification Method for Massive Datasets

Optimization-based algorithms, such as Multi-Criteria Linear programming (MCLP), have shown their effectiveness in classification. Nevertheless, due to the limitation of computation power and memory, it is difficult to apply MCLP, or similar optimization methods, to huge datasets. As the size of today’s databases is continuously increasing, it is highly important that data mining algorithms are able to perform their functions regardless of dataset sizes. The objectives of this paper are: (1) to propose a new stratified random sampling and majority-vote ensemble approach, and (2) to compare this approach with the plain MCLP approach (which uses only part of the training set), and See5 (which is a decision-tree-based classification tool designed to analyze substantial datasets), on KDD99 and KDD2004 datasets. The results indicate that this new approach not only has the potential to handle arbitrary-size of datasets, but also outperforms the plain MCLP approach and achieves comparable classification accuracy to See5.

[1]  Yong Shi,et al.  Data Mining in Credit Card Portfolio Management: A Multiple Criteria Decision Making Approach , 2001 .

[2]  Hiroshi Konno,et al.  Credit Cards Scoring with Quadratic Utility Function , 2002 .

[3]  Ludmila I. Kuncheva,et al.  Clustering-and-selection model for classifier combination , 2000, KES'2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516).

[4]  B. Parhami Voting algorithms , 1994 .

[5]  S. Uryasev,et al.  Credit cards scoring with quadratic utility functions , 2002 .

[6]  Louisa Lam,et al.  Classifier Combinations: Implementations and Theoretical Issues , 2000, Multiple Classifier Systems.

[7]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[8]  Yong Shi,et al.  Multiple criteria and multiple constraint levels linear programming : concepts, techniques and applications , 2001 .

[9]  Padraig Cunningham,et al.  An Approach to Aggregating Ensembles of Lazy Learners That Supports Explanation , 2002, ECCBR.

[10]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[11]  Salvatore J. Stolfo,et al.  Cost-based modeling for fraud and intrusion detection: results from the JAM project , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[12]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[13]  Paul S. Bradley,et al.  Mathematical Programming for Data Mining: Formulations and Challenges , 1999, INFORMS J. Comput..