An Efficient Distributed Programming Model for Mining Useful Patterns in Big Datasets

Abstract Mining combined association rules with correlation and market basket analysis can discover customer’s buying purchase rules along with frequently correlated, associated-correlated, and independent patterns synchronously which are extraordinarily useful for making everyday’s business decisions. However, due to the main memory bottleneck in single computing system, existing approaches fail to handle big datasets. Moreover, most of them cannot overcome the screenings and overhead of null transactions; hence, performance degrades drastically. In this paper, considering these limitations, we propose a distributed programming model for mining business-oriented transactional datasets by using an improved MapReduce framework on Hadoop, which overcomes not only the single processor and main memory-based computing, but also highly scalable in terms of increasing database size. Experimental results show that the technique proposed and developed in this paper are feasible for mining big transactional datasets in terms of time and scalability.

[1]  Ho-Jin Choi,et al.  Mining E-Shopper's Purchase Rules by Using Maximal Frequent Patterns: An E-Commerce Perspective , 2012, 2012 International Conference on Information Science and Applications.

[2]  Zhaohui Wu,et al.  Mining Both Associated and Correlated Patterns , 2006, International Conference on Computational Science.

[3]  Jiawei Han,et al.  CoMine: efficient mining of correlated patterns , 2003, Third IEEE International Conference on Data Mining.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Byeong-Soo Jeong,et al.  A Fast Contiguous Sequential Pattern Mining Technique in DNA Data Sequences Using Position Information , 2011 .

[6]  Byeong-Soo Jeong,et al.  A MapReduce Framework for Mining Maximal Contiguous Frequent Patterns in Large DNA Sequence Datasets , 2012 .

[7]  Jin Chang,et al.  Balanced parallel FP-Growth with MapReduce , 2010, 2010 IEEE Youth Conference on Information, Computing and Telecommunications.

[8]  Jongwook Woo,et al.  Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing , 2012 .

[9]  Shan Huang,et al.  ComMapReduce: An Improvement of MapReduce with Lightweight Communication Mechanisms , 2012, DASFAA.

[10]  Byeong-Soo Jeong,et al.  Parallel and Distributed Algorithms for Frequent Pattern Mining in Large Databases , 2009 .

[11]  Liu Jian,et al.  Prediction of E-shopper's Behavior Changes Based on Purchase Sequences , 2010, 2010 International Conference on Artificial Intelligence and Computational Intelligence.

[12]  H. T. Reynolds,et al.  The analysis of cross-classifications , 1977 .

[13]  Byeong-Soo Jeong,et al.  An Efficient Single-Pass Algorithm for Mining Association Rules from Wireless Sensor Networks , 2009 .

[14]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Byeong-Soo Jeong,et al.  A Framework for Mining High Utility Web Access Sequences , 2011 .

[16]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[17]  Zhongmei Zhou,et al.  Mining Frequent Independent Patterns and Frequent Correlated Patterns Synchronously , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[18]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[19]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[20]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[21]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[22]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[23]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[24]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[25]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.