Big data pre-processing methods with vehicle driving data using MapReduce techniques

A huge amount of sensing data are generated by a large number of pervasive IoT devices. In order to find meaningful information from the big data, it is essential to perform pre-processing, in which many outlier data points need to be removed, because they deteriorate as time passes. Although pre-processing is essential in the big data field, there has been a significant lack of research works with case studies. In this paper, big data pre-processing methods are investigated and proposed. To evaluate the pre-processing methods for accurate analysis, we used a collection of digital tachograph (DTG) data. We obtained DTG sensing data of 6198 driving vehicles over a year. We studied five kinds of pre-processing methods: filtering ranges, excluding meaningless values, comparing filters from variables, applying statistical techniques, and finding driving patterns. In addition, we developed a MapReduce program using a Hadoop ecosystem and deployed big data to perform the pre-processing analysis. Through the pre-processing steps, we confirmed that the proportion of DTG sensing data points including any errors was up to 27.09%. Compared to the traditional brute-force way to detect, ours had 71.1% additional detection effect. In addition, we confirmed that outlier data points, which are difficult to detect through simple range error pre-processing, could be well detected.

[1]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Chen Wang,et al.  Trajectory-based multi-dimensional outlier detection in wireless sensor networks using Hidden Markov Models , 2014, Wirel. Networks.

[4]  Deokjai Choi,et al.  Modeling and discovering human behavior from smartphone sensing life-log data for identification purpose , 2015, Human-centric Computing and Information Sciences.

[5]  Tin Yu Wu,et al.  Towards a framework for large-scale multimedia data storage and processing on Hadoop platform , 2013, The Journal of Supercomputing.

[6]  Milton García-Borroto,et al.  A Regularity-Based Preprocessing Method for Collaborative Recommender Systems , 2013, J. Inf. Process. Syst..

[7]  Antonio Iera,et al.  The Internet of Things: A survey , 2010, Comput. Networks.

[8]  M. Govindarajan An Outlier detection approach with data mining in wireless sensor network , 2014 .

[9]  Jongjin Park,et al.  Study on Reliability of New Digital Tachograph for Traffic Accident Investigation and Reconstruction , 2015 .

[10]  Seok-June Lee,et al.  Short-Term Impact Analysis of DTG Installation for Commercial Vehicles , 2012 .

[11]  Balqies Sadoun,et al.  The BAU GIS system using open source mapwindow , 2015, Human-centric Computing and Information Sciences.

[12]  Alberto M. C. Souza,et al.  An Outlier Detect Algorithm using Big Data Processing and Internet of Things Architecture , 2015, ANT/SEIT.

[13]  Moon-Seog Jun,et al.  An Improved Vehicle Data Format of Digital Tachograph , 2013 .

[14]  Carlos Soares,et al.  Estimating Fuel Consumption from GPS Data , 2015, IbPRIA.

[15]  Qing He,et al.  The High-Activity Parallel Implementation of Data Preprocessing Based on MapReduce , 2010, RSKT.

[16]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[17]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[18]  Nirvana Meratnia,et al.  Outlier Detection Techniques for Wireless Sensor Networks: A Survey , 2008, IEEE Communications Surveys & Tutorials.

[19]  Keqiu Li,et al.  Optimized big data K-means clustering using MapReduce , 2014, The Journal of Supercomputing.

[20]  Eunmi Choi,et al.  A GPS Trajectory Map-Matching Mechanism with DTG Big Data on the HBase System , 2015, BigDAS.