Rapid Parallel Detection of Distance-based Outliers in Time Series using MapReduce

Time series analysis is crucial in a large number of knowledge domains ranging from micro and macro economy, industry, tourism, health to hydrology, meteorology, agriculture, demography, etc. The interest in efficiently and meaningfully processing of time series data increased in the last decade with the spreading of sensor networks and Cyber-Physical Systems which produce huge amounts of measured data. The outlier detection is a key issue for Quality Assurance of time series data and its goal is to detect the objects that present a very different behavior from the expected one. Once identified, these objects are either removed or corrected. In this paper we propose a highly scalable parallel data processing algorithm for outlier ranking based on the distance between data objects. As opposed to the current existing sequential implementations, the provided algorithm is based on the parallel processing employed by the MapReduce paradigm. Using real monitored solar data for experimental validation we show the dramatically improvement of running time for large archives of time series (millions of records order).

[1]  Ciprian Dobre,et al.  MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop , 2015, Cluster Computing.

[2]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[3]  Samuel Madden,et al.  From Databases to Big Data , 2012, IEEE Internet Comput..

[4]  Mariana Mocanu,et al.  Cyberinfrastructure Architecture to Support Decision Taking in Natural Resources Management , 2013, 2013 19th International Conference on Control Systems and Computer Science.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Shuchita Upadhyaya,et al.  Outlier Detection: Applications And Techniques , 2012 .

[8]  Jinjun Chen,et al.  A security framework in G-Hadoop for big data computing across distributed Cloud data centres , 2014, J. Comput. Syst. Sci..

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[10]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[11]  Ramez Elmasri,et al.  Complete storm identification algorithms from big raw rainfall data using MapReduce framework , 2013, 2013 IEEE International Conference on Big Data.

[12]  Octavian Morariu,et al.  Transparent Real Time Monitoring for Multi-tenant J2EE Applications , 2013 .

[13]  Juan Li,et al.  An overview of energy efficiency techniques in cluster computing systems , 2013, Cluster Computing.

[14]  Pradeep Pasupuleti Pig design patterns : simplify hadoop programming to create complex end-to-end enterprise big data solutions with pig , 2014 .

[15]  Lizhe Wang,et al.  Software Design and Implementation for MapReduce across Distributed Data Centers , 2013 .

[16]  Feng Xia,et al.  A Secured Health Care Application Architecture for Cyber-Physical Systems , 2011, ArXiv.

[17]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[18]  Rajiv Ranjan,et al.  Towards Modeling Large-Scale Data Flows in a Multidatacenter Computing System With Petri Net , 2015, IEEE Systems Journal.

[19]  Achim Streit,et al.  MapReduce across Distributed Clusters for Data-intensive Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[20]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[21]  Mauro Iacono,et al.  Exploiting product forms solution techniques in multiformalism modeling , 2013, Electron. Notes Theor. Comput. Sci..

[22]  Roberto Nardone,et al.  Estimation of the Energy Consumption of Mobile Sensors in WSN Environmental Monitoring Applications , 2013, 2013 27th International Conference on Advanced Information Networking and Applications Workshops.

[23]  Mauro Iacono,et al.  Defining Formalisms for Performance Evaluation With SIMTHESys , 2011, PASM@ICPE.

[24]  Octavian Morariu,et al.  Smart Resource Allocations for Highly Adaptive Private Cloud Systems , 2014 .

[25]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[26]  Alecsandru Patrascu,et al.  Logging System for Cloud Computing Forensic Environments , 2014 .

[27]  Florin Pop,et al.  Asymptotic scheduling for many task computing in Big Data platforms , 2015, Inf. Sci..

[28]  Sean Owen,et al.  Mahout in Action , 2011 .

[29]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[30]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[31]  Ciprian Dobre,et al.  MOMC: Multi-objective and Multi-constrained Scheduling Algorithm of Many Tasks in Hadoop , 2014, 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[32]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[33]  Valentin Cristea,et al.  A Bio-Inspired Prediction Method for Water Quality in a Cyber-Infrastructure Architecture , 2014, 2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems.