Reducing the Search Space for Big Data Mining for Interesting Patterns from Uncertain Data

Many existing data mining algorithms search interesting patterns from transactional databases of precise data. However, there are situations in which data are uncertain. Items in each transaction of these probabilistic databases of uncertain data are usually associated with existential probabilities, which express the likelihood of these items to be present in the transaction. When compared with mining from precise data, the search space for mining from uncertain data is much larger due to the presence of the existential probabilities. This problem is worsened as we are moving to the era of Big data. Furthermore, in many real-life applications, users may be interested in a tiny portion of this large search space for Big data mining. Without providing opportunities for users to express the interesting patterns to be mined, many existing data mining algorithms return numerous patterns -- out of which only some are interesting. In this paper, we propose an algorithm that (i) allows users to express their interest in terms of constraints and (ii) uses the MapReduce model to mine uncertain Big data for frequent patterns that satisfy the user-specified constraints. By exploiting properties of the constraints, our algorithm greatly reduces the search space for Big data mining of uncertain data, and returns only those patterns that are interesting to the users for Big data analytics.

[1]  Carson Kai-Sang Leung Frequent Itemset Mining with Constraints , 2009, Encyclopedia of Database Systems.

[2]  Carson Kai-Sang Leung,et al.  A Tree-Based Approach for Frequent Pattern Mining from Uncertain Data , 2008, PAKDD.

[3]  Carson Kai-Sang Leung,et al.  Fast Tree-Based Mining of Frequent Itemsets from Uncertain Data , 2012, DASFAA.

[4]  Christopher Ré,et al.  Hazy: Making it Easier to Build and Maintain Big-data Analytics , 2013, CACM.

[5]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[6]  Carson Kai-Sang Leung,et al.  PUF-Tree: A Compact Tree Structure for Frequent Pattern Mining of Uncertain Data , 2013, PAKDD.

[7]  Bin Wu,et al.  Efficient Dense Structure Mining Using MapReduce , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[8]  Laks V. S. Lakshmanan,et al.  Efficient dynamic mining of constrained frequent sets , 2003, TODS.

[9]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[10]  Gautam Shroff,et al.  Approximate Incremental Big-Data Harmonization , 2013, 2013 IEEE International Congress on Big Data.

[11]  Carson Kai-Sang Leung,et al.  Mining uncertain data , 2011, WIREs Data Mining Knowl. Discov..

[12]  Paul Mineiro,et al.  Machine learning for big data , 2013, SIGMOD '13.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Samuel Madden,et al.  From Databases to Big Data , 2012, IEEE Internet Comput..

[15]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[16]  Carson Kai-Sang Leung,et al.  Frequent itemset mining of uncertain data streams using the damped window model , 2011, SAC.

[17]  Eli Upfal,et al.  PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[18]  Ming-Yen Lin,et al.  Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[19]  Ismail Ari,et al.  Online Association Rule Mining over Fast Data , 2013, 2013 IEEE International Congress on Big Data.

[20]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[21]  Carson Kai-Sang Leung,et al.  Frequent Pattern Mining from Time-Fading Streams of Uncertain Data , 2011, DaWaK.

[22]  Paolo Ceravolo,et al.  Consistent Process Mining over Big Data Triple Stores , 2013, 2013 IEEE International Congress on Big Data.

[23]  Michael Georgiopoulos,et al.  Fast parallel outlier detection for categorical datasets using MapReduce , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[24]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[25]  Simon Fong,et al.  Countering the Concept-Drift Problem in Big Data Using iOVFDT , 2013, 2013 IEEE International Congress on Big Data.