论文信息 - Parallelizing an Information Theoretic Co-clustering Algorithm Using a Cloud Middleware

Parallelizing an Information Theoretic Co-clustering Algorithm Using a Cloud Middleware

The emerging cloud environments are well suited for storage and analysis of large datasets, since they can allow on-demand access to resources. However, developing high-performance implementations of data analysis tasks is a challenging problem. In our prior work, we have developed a middleware called FREERIDE (FRamework for Rapid Implementation of Data mining Engines). FREERIDE is based upon the observation that the processing structure of a large number of data mining algorithms involves generalized reductions. FREERIDE offers a high-level interface and implements both distributed memory and shared memory parallelization. In this paper, we consider a challenging new data mining algorithm, information theoretic co-clustering, and parallelize it using the FREERIDE middleware. We show how the main processing loops of row clustering and column clustering of the Co-clustering algorithm can essentially be fit into a generalized reduction structure. We achieve good parallel efficiency, with a speedup of nearly 21 on 32 cores.

[1] Ruoming Jin,et al. Shared Memory Paraellization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. , 2002 .

[2] Wei Jiang,et al. Comparing map-reduce and FREERIDE for data-intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[3] Ruoming Jin,et al. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Randal E. Bryant,et al. Data-Intensive Supercomputing: The case for DISC , 2007 .

[6] Arlindo L. Oliveira,et al. Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7] Anil K. Jain,et al. Algorithms for Clustering Data , 1988 .

[8] Inderjit S. Dhillon,et al. Information-theoretic co-clustering , 2003, KDD '03.

[9] Inderjit S. Dhillon,et al. Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[10] Ruoming Jin,et al. A Middleware for Developing Parallel Data Mining Applications , 2001, SDM.

[11] Arlindo L. Oliveira,et al. Efficient Biclustering Algorithms for Time Series Gene Expression Data Analysis , 2009, IWANN.

[12] Christos Faloutsos,et al. Fully automatic cross-associations , 2004, KDD.