A Classification Prediction Analysis of Flight Cancellation Based on Spark

Abstract Nowadays the phenomenon of flight delays and cancellations is becoming more and more serious. Flight delays and cancellations not only waste transportation resources, but also affect passengers’ travel plans, which cause an increase in passenger discontent and complaint rates[2]. The passengers’ dissatisfaction and distrust of airlines seriously damage the airlines’ corporate reputation and then affect passengers’ loyalty. Therefore, based on the information of 5 million flight data in the United States in 2016, we predicted the flight cancellations based on four classification algorithms: logistics regression, support vector machine (SVM), naive bayes and decision tree. We compared the training time of these algorithms by changing the number of nodes in the Spark cluster[9], and we also compared the classification accuracy, AUC (Area Under Curve) and PR(Precision-Recall). Experimental results show that: (1) When the number of nodes was 1, the average training time of these algorithms was 460 seconds. When the number of nodes was 3,5,7,9, the corresponding average training time were 207.25 seconds, 89.75 seconds, 44 seconds, 42 seconds, and when the number of nodes was 11, the average training time was 43.5 seconds; (2) Starting from 7 nodes, as the number of nodes increased while the decline rate of the average training time slowed down; (3) When the number of nodes was 1 and 3, the training time of naive bayes was the shortest, which were 386 seconds and 185 seconds respectively. When the number of nodes was 5,7,9 and11, the training time of SVM was the shortest, which were 72 seconds, 21seconds, 19 seconds and 18 seconds respectively; (4) the classification accuracy of SVM and decision tree was almost 90%, but naive bayes was just about 50.8% and logistics regression was only about 62.4%; (5) Both the AUC and PR of the decision tree algorithm were the highest: AUC was 0.558 and PR was 0.439. Therefore, the decision tree algorithm was the most suitable for predicting whether the flight would be cancelled or not, and there was a 90% probability that the prediction would be successful.