MapReduce Algorithms for Big Data Analysis

There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google's MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based on Hadoop, discuss how to design efficient MapReduce algorithms and present the state-of-the-art in MapReduce algorithms for data mining, machine learning and similarity joins. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.

[1]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[2]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[3]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[4]  Younghoon Kim,et al.  Parallel Top-K Similarity Join Algorithms Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[5]  Christos Faloutsos,et al.  PEGASUS: mining peta-scale graphs , 2011, Knowledge and Information Systems.

[6]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[7]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[8]  Chao Liu,et al.  Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce , 2010, WWW '10.

[9]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[10]  Christopher Ré,et al.  Automatic Optimization for MapReduce Programs , 2011, Proc. VLDB Endow..

[11]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[12]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[13]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[14]  Enhong Chen,et al.  Towards context-aware search by learning a very large variable length hidden markov model from search logs , 2009, WWW '09.

[15]  Joydeep Ghosh,et al.  Parallel Simultaneous Co-clustering and Learning with Map-Reduce , 2010, 2010 IEEE International Conference on Granular Computing.

[16]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[17]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[18]  Feifei Li,et al.  Building Wavelet Histograms on Large Data in MapReduce , 2011, Proc. VLDB Endow..

[19]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[21]  Jordan L. Boyd-Graber,et al.  Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce , 2012, WWW.

[22]  Beng Chin Ooi,et al.  Query optimization for massively parallel data processing , 2011, SoCC.

[23]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[24]  Feng Li,et al.  An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[27]  Christos Faloutsos,et al.  Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation , 2011, PAKDD.

[28]  Younghoon Kim,et al.  TWITOBI: A Recommendation System for Twitter Using Probabilistic Modeling , 2011, 2011 IEEE 11th International Conference on Data Mining.

[29]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[30]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[31]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..