CLAP: Collaborative pattern mining for distributed information systems

The purpose of data mining from distributed information systems is usually threefold: (1) identifying locally significant patterns in individual databases; (2) discovering emerging significant patterns after unifying distributed databases in a single view; and (3) finding patterns which follow special relationships across different data collections. While existing research has significantly advanced the techniques for mining local and global patterns (the first two goals), very little attempt has been made to discover patterns across distributed databases (the third goal). Moreover, no framework currently exists to support the mining of all three types of patterns. This paper proposes solutions to discover patterns from distributed databases. More specifically, we consider pattern mining as a query process where the purpose is to discover patterns from distributed databases with patterns' relationships satisfying user specified query constraints. We argue that existing self-contained mining frameworks are neither efficient, nor feasible to fulfill the objective, mainly because their pattern pruning is single-database oriented. To solve the problem, we advocate a cross-database pruning concept and propose a collaborative pattern (CLAP) mining framework with cross-database pruning mechanisms for distributed pattern mining. In CLAP, distributed databases collaboratively exchange pattern information between sites so that each site can leverage information from other sites to gain cross-database pruning. Experimental results show that CLAP fits a niche position, and demonstrate that CLAP not only outperforms its other peers with significant runtime performance gains, but also helps find patterns incapable of being discovered by others.

[1]  Stephen Shaoyi Liao,et al.  Mining comparative opinions from customer reviews for Competitive Intelligence , 2011, Decis. Support Syst..

[2]  Xindong Wu Knowledge Discovery in Multiple Databases , 2004, ICTAI.

[3]  Philip S. Yu,et al.  CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[4]  Srinivasan Parthasarathy,et al.  Exploiting Dataset Similarity for Distributed Mining , 2000, IPDPS Workshops.

[5]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[6]  Xindong Wu,et al.  Discovering Relational Patterns across Multiple Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[8]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[9]  Xindong Wu,et al.  Knowledge Discovery in Multiple Databases , 2004, ICTAI.

[10]  Siddhartha Bhattacharyya,et al.  Data mining for credit card fraud: A comparative study , 2011, Decis. Support Syst..

[11]  Tsuyoshi Kato,et al.  Classification of heterogeneous microarray data by maximum entropy kernel , 2007, BMC Bioinformatics.

[12]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[13]  David Taniar,et al.  ODAM: An optimized distributed association rule mining algorithm , 2004, IEEE Distributed Systems Online.

[14]  Guy W. Mineau,et al.  Distributed Data Mining: Why Do More Than Aggregating Models , 2007, IJCAI.

[15]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[16]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[17]  Foster Provost,et al.  Distributed Data Mining: Scaling up and beyond , 2000 .

[18]  V. Ramachandran,et al.  Distributed classification of Gaussian space-time sources in wireless sensor networks , 2004, IEEE Journal on Selected Areas in Communications.

[19]  Yossi Matias,et al.  Spectral bloom filters , 2003, SIGMOD '03.

[20]  David Wai-Lok Cheung,et al.  Efficient Mining of Association Rules in Distributed Databases , 1996, IEEE Trans. Knowl. Data Eng..

[21]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[22]  Aoying Zhou,et al.  Bloom filter-based XML packets filtering for millions of path queries , 2005, 21st International Conference on Data Engineering (ICDE'05).

[23]  Ruoming Jin,et al.  Systematic Approach for Optimizing Complex Mining Tasks on Multiple Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[24]  Xindong Wu,et al.  Conceptual equivalence for contrast mining in classification learning , 2008, Data Knowl. Eng..

[25]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[26]  Daryl E. Hershberger,et al.  Collective Data Mining: a New Perspective toward Distributed Data Mining Advances in Distributed Data Mining Book , 1999 .

[27]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[28]  Daniel Kifer,et al.  DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints , 2002, Data Mining and Knowledge Discovery.

[29]  Ran Wolff,et al.  Communication-efficient distributed mining of association rules , 2001, SIGMOD '01.

[30]  Shirish Tatikonda,et al.  Toward terabyte pattern mining: an architecture-conscious solution , 2007, PPoPP.

[31]  William M. Pottenger,et al.  Distributed higher order association rule mining using information extracted from textual data , 2005, SKDD.

[32]  Yi Lin,et al.  Prediction Cubes , 2005, VLDB.

[33]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[34]  Shichao Zhang,et al.  Mining Multiple Data Sources: Local Pattern Analysis , 2006, Data Mining and Knowledge Discovery.

[35]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[36]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[37]  Shenghuo Zhu,et al.  Association-based similarity testing and its applications , 2003, Intell. Data Anal..

[38]  Hui Xiong,et al.  Distributed classification in peer-to-peer networks , 2007, KDD '07.

[39]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[40]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[41]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[42]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[43]  Ruoming Jin,et al.  Multiple Information Sources Cooperative Learning , 2009, IJCAI.

[44]  Joydeep Ghosh,et al.  A distributed learning framework for heterogeneous data sources , 2005, KDD '05.

[45]  Ling Qiu,et al.  Preserving privacy in association rule mining with bloom filters , 2006, Journal of Intelligent Information Systems.

[46]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[47]  Ruoming Jin,et al.  MMIS07, 08: mining multiple information sources workshop report , 2008, SKDD.

[48]  Srinivasan Parthasarathy,et al.  Mining frequent itemsets in distributed and dynamic databases , 2003, Third IEEE International Conference on Data Mining.

[49]  Bernard Chazelle,et al.  The Bloomier filter: an efficient data structure for static support lookup tables , 2004, SODA '04.

[50]  Chris Clifton,et al.  Query flocks: a generalization of association-rule mining , 1998, SIGMOD '98.

[51]  Xindong Wu,et al.  Robust ensemble learning for mining noisy data streams , 2011, Decis. Support Syst..

[52]  Xindong Wu,et al.  A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases , 2005, DaWaK.

[53]  Jimeng Sun,et al.  Distributed Pattern Discovery in Multiple Streams , 2006, PAKDD.

[54]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[55]  Xindong Wu,et al.  Synthesizing High-Frequency Rules from Different Data Sources , 2003, IEEE Trans. Knowl. Data Eng..

[56]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[57]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[58]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[59]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[60]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..