SciCSM: novel contrast set mining over scientific datasets using bitmap indices

Contrast set mining is a broadly applicable exploratory technique, which identifies interesting differences across contrast groups. The existing algorithms primarily target relational datasets with categorical attributes. There is clearly a need to apply this method to discover interesting patterns across scientific datasets, which feature arrays with numeric values. In this paper, we present a novel algorithm, SciCSM, for efficient contrast set mining over array-based datasets. We define how "interesting" contrast sets can be characterized for numeric and array data -- handling the fact that subsets can involve both value-based and/or dimension-based attributes. We extensively use bitmap indices to reduce computational complexity and enable processing of larger-scale data. We demonstrate both high efficiency and effectiveness of our algorithm by using multiple real-life datasets.

[1]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[2]  Yi Wang,et al.  Smart: a MapReduce-like framework for in-situ scientific analytics , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[5]  Yi Wang,et al.  Supporting a Light-Weight Data Management Layer over HDF5 , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[6]  Yi Wang,et al.  A novel approach for approximate aggregations over arrays , 2015, SSDBM.

[7]  Marianne Winslett,et al.  Bitmap indexes for large scientific data sets: a case study , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[8]  Howard J. Hamilton,et al.  Mining Interesting Correlated Contrast Sets , 2012, SGAI Conf..

[9]  Matthew O. Ward,et al.  Analysis Guided Visual Exploration of Multivariate Data , 2007, 2007 IEEE Symposium on Visual Analytics Science and Technology.

[10]  J. D. Burton,et al.  Marine Geochemistry , 1973, Nature.

[11]  Yi Wang,et al.  In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps , 2015, HPDC.

[12]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[13]  Florian Lemmerich,et al.  Fast Subgroup Discovery for Continuous Target Concepts , 2009, ISMIS.

[14]  Yi Wang,et al.  SDQuery DSI: Integrating data management support with a wide area data transfer protocol , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Marianne Winslett,et al.  Multi-resolution bitmap indexes for scientific data , 2007, TODS.

[16]  Arie Shoshani,et al.  Compressing bitmap indexes for faster search operations , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[17]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[18]  Arie Shoshani,et al.  Breaking the Curse of Cardinality on Bitmap Indexes , 2008, SSDBM.

[19]  James P. Ahrens,et al.  Taming massive distributed datasets: data sampling using bitmap indices , 2013, HPDC.

[20]  Ming-Chuan Wu,et al.  Query optimization for selections using bitmaps , 1999, SIGMOD '99.

[21]  Howard J. Hamilton,et al.  Mining Interesting Contrast Sets , 2012 .

[22]  Hiroki Arimura,et al.  LCM ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining , 2005 .

[23]  Alejandro P. Buchmann,et al.  Encoded bitmap indexing for data warehouses , 1998, Proceedings 14th International Conference on Data Engineering.

[24]  Arie Shoshani,et al.  Using bitmap index for interactive exploration of large datasets , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[25]  Patrick E. O'Neil,et al.  Model 204 Architecture and Performance , 1987, HPTS.

[26]  Patrick E. O'Neil,et al.  Improved query performance with variant indexes , 1997, SIGMOD '97.

[27]  G. Antoshenkov,et al.  Byte-aligned bitmap compression , 1995, Proceedings DCC '95 Data Compression Conference.

[28]  Arie Shoshani,et al.  Parallel index and query for large scale data analysis , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[30]  Yi Wang,et al.  SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[31]  Yannis E. Ioannidis,et al.  An efficient bitmap encoding scheme for selection queries , 1999, SIGMOD '99.

[32]  Gagan Agrawal,et al.  Efficient and Simplified Parallel Graph Processing over CPU and MIC , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[33]  Yannis E. Ioannidis,et al.  Bitmap index design and evaluation , 1998, SIGMOD '98.

[34]  Han-Wei Shen,et al.  An Information-Aware Framework for Exploring Multivariate Data Sets , 2013, IEEE Transactions on Visualization and Computer Graphics.