论文信息 - A New Parallel Partition Prime Multiple Algorithm for Data Mining

A New Parallel Partition Prime Multiple Algorithm for Data Mining

One of the important problems in data mining is discovering association rules from databases. Each transaction contains a set of items. Discovering the frequent itemsets require a lot of computation power, memory and input/output values, which can only be provided by parallel computer. In this paper, we proposed a new Parallel Partition Prime Multiple Algorithm for association rule mining. Proposed algorithm addresses the shortcoming of previously proposed Parallel Buddy Prima Algorithm. The proposed algorithm divides transaction database equally according to their assignment of variable for each processor. The decision of assignment of next transaction to the processor depends on the value of count variable of itemset per transaction. It reduces the time and data complexity. 1. OVERVIEW OF DATA MINING The explosive growth of data poses a challenge for finding new techniques to extract useful patterns from such a huge amount of data. Data mining emerged as the new research area to meet this challenge and recently attracted a lot of research attention. “Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets”[1]. These tools can include statistical models, mathematical algorithms and machine learning methods (algorithms that improve their performance automatically through experience such as neural networks or decision trees). Consequently, data mining consists of huge amount of collecting and managing data, it also includes analysis and prediction [2]. 1.1 KNOWLEDGE DISCOVERY IN DATABASE (KDD): The real world data tend to be incomplete and noisy due to the manual input mistakes. The integrated data sources can be stored in a database, data warehouse or other repositories. The second process is to select task related data from the integrated resources and transform them into a format that is ready to be mined. Suppose we want to analyze which items are often purchased together in a supermarket and the database that records the purchase history may contains customer ID, items bought, transaction time, prices, number of each item and so on. 1.2 ASSOCIATION RULE Association rule mining [3][4] is one of the most important and well-researched techniques of data mining. It aims to extract interesting correlations, frequent patterns, association or casual structures among sets of items in the transaction database or other data repositories. Association rule mining finds interesting association or correlation relationships among a large set of data items [3]. These rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold [5]. A more formal definition is given in [6]. Let I = {i1, i2... im} be a set of items and D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that T ⊆ I. Each transaction is associated with an identifier, called TRANSACTION ID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T. An association rule is implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, and A ∩ B = φ [3]. Support (s): the support s of the rule A⇒B is defined as ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 66 |Pa g e w w w . c i r w o r l d . c o m Confidence (c):Confidence defined as the rule 2. PARALLEL BUDDY PRIMA ALGORITHMS 2.1 Buddy Prima Algorithm: In Buddy Prima Algorithm, Support count can be calculated easly.The weakness of this representation is that the product of the prime number is very large number for a transaction with more number of items This algorithm requires lot of computation power, memory and input/output values for large-scale association mining. To overcome these problems, Parallel Buddy Prima algorithm [7], a parallel version is proposed. This representation uses Prime numbers to represent the items in the transaction. Each item is assigned a unique Prime number. Each transaction is represented by the product of the corresponding prime numbers of individual items in the transaction. Since the product of the prime numbers is unique, modulo division of prime product of the itemset can check the presence of itemset in the transaction. If the remainder is zero, then the itemset is present in the transaction. If the remainder is nonzero, then the itemset is not present in the transaction. By checking the presence of itemset in transactions using the above method Buddy Prima algorithm uses Candidate Distribution technique. This algorithm provides scalability, in terms of the data dimension, size or runtime performance for large databases. 2.2 Parallel Buddy Prima Algorithm: In this algorithm, the computation time of the itemset generation is reduced. Candidate distribution technique assigns the candidate itemsets generated from different parts of database to different processors and each processor is assigned disjoint candidates, independent of other processors. At the same time, the database is shared among all processors, so that each processor can generate global count independently. The Master node prunes the transactions by removing 1-infrequent itemsets and stores the Prime multiple for each transaction in shared memory. It finds the Maximal length transaction size Maxlen and puts in shared memory. It divides the transactions equally in each node for candidate generation. Though horizontal partitioning, vertical partitioning and checkerboard partitioning method can be used to divide and distribute the transactions. Master connects to each slave node and initiates the process of finding the frequent itemset. Finally, the Master node shows the global frequent itemsets after gathering the local frequent itemsets. After the Master node initiates the slave node, it reads the allotted number of transactions and maximal length transaction size Maxlen. It uses the buddy approach to find the maximal frequent itemset and Prima representation to quickly find support count of an itemset. Then, it returns the frequent itemsets to Master node. For partitioning, candidate distribution technique is adopted to handle large datasets with large itemsets. Because, previously itemset used are long string, consumes high memory space beside this data scanning is difficult and time consuming. Here we are proposing to use PRIME number to assign items. The PRIME representation consumes less memory as each transaction is replaced with the product of the equivalent prime numbers of their items, as results, it reduces the time taken to determine the support count of the Itemset [7]. Similarly, it reduces the time and data complexity because of unique multiplication property. The performance of proposed algorithm is studied and compared with the other existing algorithms. 3. IMPLEMENTATION OF PARALLEL BUDDY PRIMA ALGORITHM Here, we implemented Parallel Buddy Prima Algorithm [8] with real life transactions in supermarket. In this example, Table 1 represents the transaction ID at a Database which occur for selling of each items at the shop. Table 2 represents assigned prime number allotted to every item that was sold in the supermarket. There are various items that are not sold in supermarket very frequently therefore we set a minimum support count 3 to remove infrequent items in our example as represented in Table 3, Thereafter, Table 4 is generated to calculate Prime multiplications of transactions in the database of supermarket. Table 1: Transaction Database for Supermarket ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 67 |Pa g e w w w . c i r w o r l d . c o m TID Transactions T1 1,3,7,13 T2 4,6,10,11 T3 3,9,13 T4 4,5,7,8,14 T5 1,2,3,7,9,13 T6 4,5,7,8,9,10,14 T7 1,2,4,5,6,10,11 T8 1,3,7,9,10 T9 1,3,7,11,13 T10 4,5,6,10,11 T11 1,3,4,5,7,8,14 T12 5,8,12 Table 2: Assign item numbers and equivalent prime number of item Items Allotted Prime Number 1 2 2 3 3 5 4 7 5 11 6 13 7 17 8 19 9 23 10 29 11 31 12 37 13 41 14 43 Table 3: Transaction database after removing infrequent item. TID Transaction T1 1,3,7,13 T2 4,6,10,11 T3 3,1,13 T4 4,5,7,8,14 T5 1,3,7,9,13 T6 4,5,7,8,9,10,14 T7 1,4,5,6,10,11 T8 1,3,7,9,10 T9 1,3,7,11,13 T10 4,5,6,10,11 T11 1,3,4,5,7,8,14 T12 5,8 Table 4: Prima Representation of Transaction Database and their Prime Multiplications. TID Transaction Trans. Multiple T1 2*5*17*41 6970 T2 7*13*29*31 81809 T3 5*23*41 4715 T4 7*11*17*19*43* 1069453 ISSN: 2278-5183 International Journal of Computers and Distributed Systems www.ijcdsonline.com Vol. No.2, Issue 1, December 2012 68 |Pa g e w w w . c i r w o r l d . c o m T5 2*5*17*23*41 160310 T6 7*11*17*19*23*29*43 713325151 T7 2*7*11*13*29*31 1799798 T8 2*5*17*23*29 113390 T9 2*5*17*31*41 216070 T10 7*11*13*29*31 899899 T11 2*5*7*11*17*19*43 10694530 T12 11*19 209 Now suppose we want to know that itemset {3, 7}occurs in which transactions, we take allotted prime number to item {3,5} and multiply 5*17=85 (see Table 2) and perform modulo Division as shown in Table 5. If the remainder is 0 for modulo division of transactions multiple, it indicate that item is present in the transaction set. The Table 5 representing Modulo Division for finding the support count of {3, 7} shows presence of item. Now, we can conclude that the {3, 7} is present in transaction T1, T5, T8, T9, T11. Table 5: Support count determination for {3, 7} TID Modulo Division Remainder Items Presence T1 6970 mod 85 0 Yes T2 81809 mod 85 Non-Zero No T3 4715 mod 85 Non-Zero No T4 1069453 mod 85 Non-Zero No T5 160310 mod 85 0 Yes T6 713325151 mod 85 Non-Zero No T7 1799798 mod 85 Non-Zero No T8 113390 mod 85 0 Yes T9 216070 mod 85 0 Yes T10 899899 mod 85 Non-Zero No T11 10694530 mod 85 0 Yes T12 209 mod 85 Non-Zero No The parallel buddy prima algorithm does not follow any intelligent load balancing algorithm Therefore, it may

P. Bhattacharya | Jitendra Agrawal | M. Tiwari

[1] R. Agarwal. Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[2] Shamkant B. Navathe,et al. An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[3] Vipin Kumar,et al. Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[4] David B. Skillicorn,et al. Strategies for parallel data mining , 1999, IEEE Concurr..

[5] Markus Hegland,et al. Algorithms for Association Rules , 2002, Machine Learning Summer School.

[6] Petra Perner,et al. Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[7] S. Sumathi,et al. Parallel Buddy Prima - A Hybrid Parallel Frequent itemset mining algorithm for very large databases , 2004 .

[8] Dimitris Kanellopoulos,et al. Association Rules Mining: A Recent Overview , 2006 .

[9] Frans Coenen,et al. A Novel Rule Weighting Approach in Classification Association Rule Mining , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).