AccurateML: Information-aggregation-based approximate processing for fast and accurate machine learning on MapReduce

The growing demands of processing massive datasets have promoted irresistible trends of running machine learning applications on MapReduce. When processing large input data, it is often of greater values to produce fast and accurate enough approximate results than slow exact results. Existing techniques produce approximate results by processing parts of the input data, thus incurring large accuracy losses when using short job execution times, because all the skipped input data potentially contributes to result accuracy. We address this limitation by proposing AccurateML that aggregates information of input data in each map task to create small aggregated data points. These aggregated points enable all map tasks producing initial outputs quickly to save computation times and decrease the outputs' size to reduce communication times. Our approach further identifies the parts of input data most related to result accuracy, thus first using these parts to improve the produced outputs to minimize accuracy losses. We evaluated AccurateML using real machine learning applications and datasets. The results show: (i) it reduces execution times by 30 times with small accuracy losses compared to exact results; (ii) when using the same execution times, it achieves 2.71 times reductions in accuracy losses compared to existing approximate processing techniques.

[1]  Murali S. Kodialam,et al.  Joint scheduling of processing and Shuffle phases in MapReduce systems , 2012, 2012 Proceedings IEEE INFOCOM.

[2]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[3]  Anand Raghunathan,et al.  ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters , 2014, USENIX Annual Technical Conference.

[4]  Kaushik Roy,et al.  Analysis and characterization of inherent application resilience for approximate computing , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[6]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  NIDHI TIWARI,et al.  Classification Framework of MapReduce Scheduling Algorithms , 2015, ACM Comput. Surv..

[8]  Rui Han,et al.  CLAP: Component-Level Approximate Processing for Low Tail Latency and High Result Accuracy in Cloud Online Services , 2017, IEEE Transactions on Parallel and Distributed Systems.

[9]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[10]  Anand Raghunathan,et al.  Best-effort computing: Re-thinking parallel software and hardware , 2010, Design Automation Conference.

[11]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[12]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[13]  Mung Chiang,et al.  Need for speed: CORA scheduler for optimizing completion-times in the cloud , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[14]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[15]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[16]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[17]  Roy H. Campbell,et al.  Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[18]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[19]  Logan Kugler Is "good enough" computing good enough? , 2015, Commun. ACM.

[20]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[21]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[22]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[23]  Deying Li,et al.  Minimizing makespan and total completion time in MapReduce-like systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[24]  Matei Zaharia,et al.  Job Scheduling for Multi-User MapReduce Clusters , 2009 .