Dataset Popularity Prediction for Caching of CMS Big Data

The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. The historical usage data recorded by this large infrastructure is a rich source of information for system tuning and capacity planning. In this paper we investigate how to leverage machine learning on this huge amount of data in order to discover patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure in terms of CPU utilization and task completion time. In particular we propose a scalable pipeline of components built on top of the Spark engine for large-scale data processing, whose goal is collecting from different sites the dataset access logs, organizing them into weekly snapshots, and training, on these snapshots, predictive models able to forecast which datasets will become popular over time. The high accuracy achieved indicates the ability of the learned model to correctly separate popular datasets from unpopular ones. Dataset popularity predictions are then exploited within a novel data caching policy, called PPC (Popularity Prediction Caching). We evaluate the performance of PPC against popular caching policy baselines like LRU (Least Recently Used). The experiments conducted on large traces of real dataset accesses show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.

[1]  Jamie Shiers,et al.  The Worldwide LHC Computing Grid (worldwide LCG) , 2007, Comput. Phys. Commun..

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[4]  Sherif Sakr,et al.  Big Data 2.0 Processing Systems: Taxonomy and Open Challenges , 2016, Journal of Grid Computing.

[5]  Fons Rademakers,et al.  ROOT — An object oriented data analysis framework , 1997 .

[6]  A. Songwattana,et al.  Mining Web Logs for Prediction in Prefetching and Caching , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[7]  Daniele Bonacorsi,et al.  Predicting dataset popularity for the CMS experiment , 2016, 1602.07226.

[8]  João Paulo Teixeira,et al.  The CMS experiment at the CERN LHC , 2008 .

[9]  Pier Luca Lanzi,et al.  Mining interesting knowledge from weblogs: a survey , 2005, Data Knowl. Eng..

[10]  Joydeep Ghosh,et al.  Pervasive parallelism in data mining: dataflow solution to co-clustering large and sparse Netflix data , 2009, KDD.

[11]  Fabrizio Silvestri,et al.  Boosting the performance of Web search engines: Caching and prefetching query results by exploiting historical usage data , 2006, TOIS.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Dino Pedreschi,et al.  Web log data warehousing and mining for intelligent web caching , 2001, Data Knowl. Eng..

[14]  Philippe Charpentier,et al.  Disk storage management for LHCb based on Data Popularity estimator , 2015, ArXiv.

[15]  Yin-Fu Huang,et al.  Mining web logs to improve hit ratios of prefetching and caching , 2008, Knowl. Based Syst..

[16]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[17]  David B. Skillicorn,et al.  Classification Using Streaming Random Forests , 2011, IEEE Transactions on Knowledge and Data Engineering.

[18]  Luca Menichetti,et al.  Exploiting Apache Spark platform for CMS computing analytics , 2017, Journal of Physics: Conference Series.

[19]  Salvatore Orlando,et al.  Speeding up Document Ranking with Rank-based Features , 2015, SIGIR.

[20]  Ludmila Cherkasova,et al.  Improving WWW Proxies Performance with Greedy-Dual- Size-Frequency Caching Policy , 1998 .

[21]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[22]  Daniele Bonacorsi,et al.  Exploring Patterns and Correlations in CMS Computing Operations Data with Big Data Analytics Techniques , 2016 .

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  The Cms Collaboration Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC , 2012, 1207.7235.

[25]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[26]  Qiang Yang,et al.  Web-Log Mining for Predictive Web Caching , 2003, IEEE Trans. Knowl. Data Eng..

[27]  Kavitha Ranganathan,et al.  Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids , 2003, Journal of Grid Computing.

[28]  Jawwad Shamsi,et al.  Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions , 2013, Journal of Grid Computing.

[29]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[30]  Graeme Stewart,et al.  Popularity Prediction Tool for ATLAS Distributed Data Management , 2014 .

[31]  Luca Canali,et al.  Scale out databases for CERN use cases , 2015 .

[32]  Wei Xu,et al.  Advances and challenges in log analysis , 2011, Commun. ACM.

[33]  Fabrizio Silvestri,et al.  Aging effects on query flow graphs for query suggestion , 2009, CIKM.

[34]  Yin-Fu Huang,et al.  Mining Web logs to improve hit ratios of prefetching and caching , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[35]  Xingquan Zhu,et al.  Knowledge Discovery and Data Mining: Challenges and Realities , 2007 .

[36]  Domenico Giordano,et al.  XRootD popularity on hadoop clusters , 2017 .