Why Naive Ensembles Do Not Work in Cloud Computing

One of the greatest challenges of data mining is dealing with very large datasets. Cloud computing has demonstrated great advantages in processing very large datasets. When considering taking advantage of the high performance data cloud to do data mining, there are different approaches to make an existing data mining algorithm parallelizable in a cloud computing environment. One concern is how to achieve better performance by making use of the data in a more intelligent way. In this paper, we describe two different approaches to parallelize the existing random decision tree mining algorithm, which we have built on the Sector/Sphere cloud computing environment. We compare the cost and accuracy between those two different implementations and analyze the result of this experimental study.

[1]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[2]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[5]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.

[6]  Nitesh V. Chawla,et al.  Learning Ensembles from Bites: A Scalable and Accurate Approach , 2004, J. Mach. Learn. Res..

[7]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..

[8]  Stuart J. Russell,et al.  Decision Theoretic Subsampling for Induction on Large Databases , 1993, ICML.

[9]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[10]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[11]  Nitesh V. Chawla,et al.  Scaling up Classifiers to Cloud Computers , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  D. P. Mercer,et al.  Clustering large datasets , 2003 .

[13]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[14]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[15]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[16]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[17]  GhemawatSanjay,et al.  The Google file system , 2003 .

[18]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[19]  Robert L. Grossman,et al.  Sector and Sphere: the design and implementation of a high-performance data cloud , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[20]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[21]  Robert L. Grossman,et al.  Data Mining and Tree-Based Optimization , 1996, KDD.

[22]  Shubin Zhao,et al.  Corroborate and learn facts from the web , 2007, KDD '07.

[23]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[24]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Murat Ali Bayir,et al.  Smart Miner: a new framework for mining large scale web usage data , 2009, WWW '09.