Distributed Decision-Tree Induction in Peer-to-Peer Systems

This paper offers a scalable and robust distributed algorithm for decision-tree induction in large peer-to-peer (P2P) environments. Computing a decision tree in such large distributed systems using standard centralized algorithms can be very communication-expensive and impractical because of the synchronization requirements. The problem becomes even more challenging in the distributed stream monitoring scenario where the decision tree needs to be updated in response to changes in the data distribution. This paper presents an alternate solution that works in a completely asynchronous manner in distributed environments and offers low communication overhead, a necessity for scalability. It also seamlessly handles changes in data and peer failures. The paper presents extensive experimental results to corroborate the theoretical claims. Copyright © 2008 Wiley Periodicals, Inc., A Wiley Company Statistical Analy Data Mining 1: 000-000, 2008

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Salvatore J. Stolfo,et al.  JAM: Java Agents for Meta-Learning over Distributed Databases , 1997, KDD.

[3]  Hillol Kargupta,et al.  Uniform Data Sampling from a Peer-to-Peer Network , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[4]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[5]  Idit Keidar,et al.  Want scalable computing?: speculate! , 2006, SIGA.

[6]  Joydeep Ghosh,et al.  A distributed learning framework for heterogeneous data sources , 2005, KDD '05.

[7]  Ran Wolff,et al.  k-TTP: a new privacy model for large-scale distributed environments , 2004, KDD.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[9]  Ran Wolff,et al.  Local L2-Thresholding Based Data Mining in Peer-to-Peer Systems , 2006, SDM.

[10]  Matti Latva-aho,et al.  Nonblind and semiblind space-time-frequency multiuser detection for space-time block-coded MC-CDMA , 2005, IEEE Trans. Wirel. Commun..

[11]  Ran Wolff,et al.  Mining for misconfigured machines in grid systems , 2006, KDD '06.

[12]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[13]  A. Schuster,et al.  Association rule mining in peer-to-peer systems , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[14]  J. J. Garcia-Luna-Aceves,et al.  A path-finding algorithm for loop-free routing , 1997, TNET.

[15]  Lui Sha,et al.  Design and analysis of an MST-based topology control algorithm , 2005, IEEE Trans. Wirel. Commun..

[16]  Ujjwal Maulik,et al.  Clustering distributed data streams in peer-to-peer environments , 2006, Inf. Sci..

[17]  Kun Liu,et al.  Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network , 2008, IEEE Transactions on Knowledge and Data Engineering.

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Hillol Kargupta,et al.  Distributed Multivariate Regression Using Wavelet-Based Collective Data Mining , 2001, J. Parallel Distributed Comput..

[20]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[21]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[22]  Ran Wolff,et al.  A Local Facility Location Algorithm for Large-scale Distributed Systems , 2007, Journal of Grid Computing.

[23]  Ran Wolff,et al.  A Local Algorithm for Ad Hoc Majority Voting via Charge Fusion , 2004, DISC.

[24]  Kun Liu,et al.  Client-side web mining for community formation in peer-to-peer environments , 2006, SKDD.

[25]  H. Vincent Poor,et al.  Distributed Kernel Regression: An Algorithm for Training Collaboratively , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Punta del Este.

[26]  Kun Liu,et al.  Communication efficient construction of decision trees over heterogeneously distributed data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[27]  C. Guestrin,et al.  Distributed regression: an efficient framework for modeling sensor network data , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[28]  Hui Xiong,et al.  Distributed classification in peer-to-peer networks , 2007, KDD '07.

[29]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[30]  Ingo Mierswa Collaborative Use of Features in a Distributed System for the Organization of Music Collections , 2008 .

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  Idit Keidar,et al.  Veracity radius: capturing the locality of distributed computations , 2006, PODC '06.

[33]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[34]  Sudipta Mahapatra,et al.  Heartbeat based fault diagnosis for mobile ad-hoc network , 2007 .

[35]  Ran Wolff,et al.  Hierarchical decision tree induction in distributed genomic databases , 2005, IEEE Transactions on Knowledge and Data Engineering.

[36]  Ian Witten,et al.  Data Mining , 2000 .

[37]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[38]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[39]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[40]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[41]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[42]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[43]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[44]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[45]  Hillol Kargupta,et al.  Distributed probabilistic inferencing in sensor networks using variational approximation , 2008, J. Parallel Distributed Comput..

[46]  Ran Wolff,et al.  In-Network Outlier Detection in Wireless Sensor Networks , 2006, ICDCS.