Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction

In modern society, our everyday life has a close connection with traffic issues. One of the most burning issues is about predicting traffic accidents. Predicting accidents on the road can be achieved by classification analysis, a data mining procedure requiring enough data to build a learning model. Regarding building such a predicting system, there are several problems. It requires lots of hardware resources to collect traffic data and analyze it for predicting traffic accidents since the data is very huge. Furthermore, data related to traffic accidents is few comparing with data which is not related to them. The numbers of two types of data are imbalanced. The purpose of this paper is to build a predicting model that can resolve all these problems. This paper suggests using Hadoop framework to process and analyze big traffic data efficiently and a sampling method to resolve the problem of data imbalance. Based on this, the predicting system, first of all, preprocess traffic big data and analyzes it to create data for the learning system. The imbalance of created data is corrected by a sampling method. To improve predicting accuracy, corrected data is classified into several groups, to which classification analysis is applied. These analysis steps are processed by Hadoop framework.

[1]  Fan Zhang,et al.  Dataset Scaling and MapReduce Performance , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[2]  Prakash S. Raghavendra,et al.  Comparative study of neural networks and k-means classification in web usage mining , 2010, 2010 International Conference for Internet Technology and Secured Transactions.

[3]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[4]  Hannu Tenhunen,et al.  Performance analysis of low oversampling ratio sigma-delta noise shapers for RF applications , 1998, ISCAS '98. Proceedings of the 1998 IEEE International Symposium on Circuits and Systems (Cat. No.98CH36187).

[5]  Matjaz Kukar,et al.  Transduction and typicalness for quality assessment of individual classifications in machine learning and data mining , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Houda Benbrahim,et al.  An empirical study to address the problem of Unbalanced Data Sets in sentiment classification , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Jeffrey C. Carver,et al.  Characterizing Software Architecture Changes: An Initial Study , 2007, ESEM 2007.

[9]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, ESEM 2007.

[10]  S. W. Purnami,et al.  Applying Kernel Logistic Regression in data mining to classify credit risk , 2008, 2008 International Symposium on Information Technology.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.