Discovering Contrast Sets for Efficient Classification of Big Data

Due to the recent emergence of Big Data, it is essential to develop techniques to reduce the processing time of such Big Data. In this paper, we propose to reduce the dimensionality of objects' feature vectors by discovering their contrast sets. Contrast set mining aims at finding a set of rules that best distinguish the instances of different user-defined groups. Thus, contrast sets are conjunctions of attribute-value pairs that are significantly more frequent in one group than the other. Existing techniques extract contrast sets from categorical data or discretized numerical data. Furthermore, existing rule-based contrast sets methods require some user-defined thresholds to select the contrast sets. To overcome these limitations, we propose a greedy algorithm, called DisCoSet, to incrementally find a minimum set of local features that best distinguishes a class from other classes without resorting to discretization. We show that the proposed algorithm reduces the dimensionality between 40%-97% of the original length and yet improves the classification accuracy by 10%-24% on different datasets.

[1]  Osmar R. Zaïane,et al.  Contrasting Sequence Groups by Emerging Sequences , 2009, Discovery Science.

[2]  Tim Futing Liao,et al.  Statistical Group Comparison , 2002 .

[3]  Soung Hie Kim,et al.  Mining the change of customer behavior in an internet shopping mall , 2001, Expert Syst. Appl..

[4]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[5]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[6]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[7]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[8]  Tiejun Tong,et al.  Gene Selection Using Iterative Feature Elimination Random Forests for Survival Outcomes , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Eamonn J. Keogh,et al.  Group SAX: Extending the Notion of Contrast Sets to Time Series and Multimedia Data , 2006, PKDD.

[10]  Imran N. Junejo,et al.  Using SAX representation for human action recognition , 2012, J. Vis. Commun. Image Represent..

[11]  William F. Punch,et al.  Mining interesting contrast rules for a web-based educational system , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[12]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[13]  Zaher Al Aghbari Effective Image Mining by Representing Color Histograms as Time Series , 2009, J. Adv. Comput. Intell. Intell. Informatics.

[14]  Xiaohui Lin,et al.  A support vector machine-recursive feature elimination feature selection method based on artificial contrast variables and mutual information. , 2012, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[15]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[16]  Bruno Crémilleux,et al.  Condensed Representation of Emerging Patterns , 2004, PAKDD.

[17]  Tim Futing Liao Statistical Group Comparison: Liao/Statistical , 2002 .

[18]  Branko Kavsek,et al.  APRIORI-SD: ADAPTING ASSOCIATION RULE LEARNING TO SUBGROUP DISCOVERY , 2006, IDA.