Predicting Hourly Boarding Demand of Bus Passengers Using Imbalanced Records From Smart-Cards: A Deep Learning Approach

The tap-on smart-card data provides a valuable source to learn passengers’ boarding behaviour and predict future travel demand. However, when examining the smart-card records (or instances) by the time of day and by boarding stops, the positive instances (i.e. boarding at a specific bus stop at a specific time) are rare compared to negative instances (not boarding at that bus stop at that time). Imbalanced data has been demonstrated to significantly reduce the accuracy of machine-learning models deployed for predicting hourly boarding numbers from a particular location. This paper addresses this data imbalance issue in the smart-card data before applying it to predict bus boarding demand. We propose the deep generative adversarial nets (Deep-GAN) to generate dummy travelling instances to add to a synthetic training dataset with more balanced travelling and non-travelling instances. The synthetic dataset is then used to train a deep neural network (DNN) for predicting the travelling and non-travelling instances from a particular stop in a given time window. The results show that addressing the data imbalance issue can significantly improve the predictive model’s performance and better fit ridership’s actual profile. Comparing the performance of the Deep-GAN with other traditional resampling methods shows that the proposed method can produce a synthetic training dataset with a higher similarity and diversity and, thus, a stronger prediction power. The paper highlights the significance and provides practical guidance in improving the data quality and model performance on travel behaviour prediction and individual travel behaviour analysis.

[1]  S. Marrone,et al.  Artificial Intelligence in Railway Transport: Taxonomy, Regulations, and Applications , 2022, IEEE Transactions on Intelligent Transportation Systems.

[2]  Toshiyuki Yamamoto,et al.  H-ConvLSTM-based bagging learning approach for ride-hailing demand prediction considering imbalance problems and sparse uncertainty , 2022, Transportation Research Part C: Emerging Technologies.

[3]  Wenwu Yu,et al.  Ridesourcing Behavior Analysis and Prediction: A Network Perspective , 2022, IEEE transactions on intelligent transportation systems (Print).

[4]  D. Sun,et al.  A Spatio-temporal Distribution Model for Determining Origin–Destination Demand from Multisource Data , 2022, Logic-Driven Traffic Big Data Analytics.

[5]  Daniel(Jian) Sun,et al.  Energy consumption simulation and economic benefit analysis for urban electric commercial-vehicles , 2021, Transportation Research Part D: Transport and Environment.

[6]  Ronghui Liu,et al.  Multi-stage deep learning approaches to predict boarding behaviour of bus passengers , 2021 .

[7]  Zhiyuan Liu,et al.  Bus OD matrix reconstruction based on clustering Wi-Fi probe data , 2021, Transportmetrica B: Transport Dynamics.

[8]  Zhiqiang Ge,et al.  Data Augmentation Classifier for Imbalanced Fault Classification , 2021, IEEE Transactions on Automation Science and Engineering.

[9]  Zhiyuan Liu,et al.  Short-term forecasts on individual accessibility in bus system based on neural network model , 2021 .

[10]  Yang Liu,et al.  Automatic Feature Engineering for Bus Passenger Flow Prediction Based on Modular Convolutional Neural Network , 2021, IEEE Transactions on Intelligent Transportation Systems.

[11]  Linlin You,et al.  Commercial Vehicle Activity Prediction With Imbalanced Class Distribution Using a Hybrid Sampling and Gradient Boosting Approach , 2021, IEEE Transactions on Intelligent Transportation Systems.

[12]  Pourya Shamsolmoali,et al.  Imbalanced Data Learning by Minority Class Augmentation using Capsule Adversarial Networks , 2020, Neurocomputing.

[13]  Peng Li,et al.  Predicting peak load of bus routes with supply optimization and scaled Shepard interpolation: A newsvendor model , 2020 .

[14]  Daniel Sun,et al.  Taxi hailing choice behavior and economic benefit analysis of emission reduction based on multi-mode travel big data , 2020, Transport Policy.

[15]  Mohamed Abdel-Aty,et al.  Real-time crash prediction on expressways using deep generative models , 2020 .

[16]  Kevin Heaslip,et al.  A deep convolutional neural network based approach for vehicle classification using large-scale GPS trajectory data , 2020, Transportation Research Part C: Emerging Technologies.

[17]  Ronghui Liu,et al.  Incorporating weather conditions and travel history in estimating the alighting bus stops from smart card data , 2020, Sustainable Cities and Society.

[18]  Sybil Derrible,et al.  Real-time accident detection: Coping with imbalanced data. , 2019, Accident; analysis and prevention.

[19]  Yan Liu,et al.  The influence of weather conditions on adult transit ridership in the sub-tropics , 2019, Transportation Research Part A: Policy and Practice.

[20]  Amalia Luque,et al.  The impact of class imbalance in classification performance metrics based on the binary confusion matrix , 2019, Pattern Recognit..

[21]  Alexis J. Comber,et al.  Who, Where, Why and When? Using Smart Card and Social Media Data to Understand Urban Mobility , 2019, ISPRS Int. J. Geo Inf..

[22]  Faeze Ghofrani,et al.  Exploring the impact of foot-by-foot track geometry on the occurrence of rail defects , 2019, Transportation Research Part C: Emerging Technologies.

[23]  Ronghui Liu,et al.  Stochastic bus schedule coordination considering demand assignment and rerouting of passengers , 2019, Transportation Research Part B: Methodological.

[24]  Erik Nelson,et al.  Estimating the Impact of Ride-Hailing App Company Entry on Public Transportation Use in Major US Urban Areas , 2018, The B.E. Journal of Economic Analysis & Policy.

[25]  Qingnian Zhang,et al.  Evaluation of urban public transport priority performance based on the improved TOPSIS method: A case study of Wuhan , 2018, Sustainable Cities and Society.

[26]  Boris Chidlovskii,et al.  Mining Smart Card Data for Travelers' Mini Activities , 2017, ArXiv.

[27]  Zhen Liu,et al.  A hybrid method based on ensemble WELM for handling multi class imbalance in cancer microarray data , 2017, Neurocomputing.

[28]  Ronghui Liu,et al.  Modelling bus bunching and holding control with vehicle overtaking and distributed passenger boarding behaviour , 2017 .

[29]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[30]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[31]  Flora D. Salim,et al.  Predicting Imbalanced Taxi and Passenger Queue Contexts in Airport , 2017, PACIS.

[32]  Ronghui Liu,et al.  Bus bunching along a corridor served by two lines , 2016 .

[33]  Weitiao Wu,et al.  Designing robust schedule coordination scheme for transit networks with safety control margins , 2016 .

[34]  Maria Bordagaray,et al.  Capturing the conditions that introduce systematic variation in bike-sharing travel behavior using data mining techniques , 2016 .

[35]  Ziyou Gao,et al.  Timetable coordination of first trains in urban railway network: A case study of Beijing , 2016 .

[36]  Ajinkya More,et al.  Survey of resampling techniques for improving classification performance in unbalanced datasets , 2016, ArXiv.

[37]  Jungang Shi,et al.  Identifying passenger flow characteristics and evaluating travel time reliability by visualizing AFC data: a case study of Shanghai Metro , 2016, Public Transp..

[38]  Jia Song,et al.  A bi-directional sampling based on K-means method for imbalance text classification , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[39]  Jian Gao,et al.  A new sampling method for classifying imbalanced data based on support vector machine ensemble , 2016, Neurocomputing.

[40]  B. Krawczyk Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[41]  Soo Chen Kwan,et al.  A review on co-benefits of mass public transportation in climate change mitigation , 2016 .

[42]  José Francisco Martínez Trinidad,et al.  Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases , 2016, Neurocomputing.

[43]  Yongdong Zhang,et al.  Adaptive weighted imbalance learning with application to abnormal activity recognition , 2016, Neurocomputing.

[44]  Jong-Seok Lee,et al.  A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification , 2016, IMCOM.

[45]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[46]  David M. J. Tax,et al.  Semi-supervised rail defect detection from imbalanced image data , 2016 .

[47]  Kari Watkins,et al.  A real-time bus dispatching policy to minimize passenger wait on a high frequency route , 2015 .

[48]  Robert B. Fisher,et al.  Classifying imbalanced data sets using similarity based hierarchical decomposition , 2015, Pattern Recognit..

[49]  David M. W. Powers,et al.  What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes , 2015, ArXiv.

[50]  Najoua Essoukri Ben Amara,et al.  A hybrid sampling method for imbalanced data , 2015, 2015 IEEE 12th International Multi-Conference on Systems, Signals & Devices (SSD15).

[51]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[52]  Ronghui Liu,et al.  A model of bus bunching under reliability-based passenger arrival patterns. , 2015 .

[53]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[54]  Seong-hun Park,et al.  Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction , 2014, 2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[55]  Amos Azaria,et al.  Behavioral Analysis of Insider Threat: A Survey and Bootstrapped Prediction in Imbalanced Data , 2014, IEEE Transactions on Computational Social Systems.

[56]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[57]  Ligang Zhou,et al.  Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods , 2013, Knowl. Based Syst..

[58]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[59]  Yan-Ping Zhang,et al.  Cluster-based majority under-sampling approaches for class imbalance learning , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[60]  Misha Denil,et al.  Overlap versus Imbalance , 2010, Canadian Conference on AI.

[61]  Tian-Yu Liu,et al.  EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[62]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[63]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[64]  Ronghui Liu,et al.  Assessing Bus Transport Reliability Using Micro-Simulation , 2008 .

[65]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[66]  Ronghui Liu,et al.  Estimation of the distribution of travel times by repeated simulation , 2008 .

[67]  Ronghui Liu,et al.  Modelling Urban Bus Service and Passenger Reliability. , 2007 .

[68]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[69]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[70]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[71]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[72]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[73]  D. Dowson,et al.  The Fréchet distance between multivariate normal distributions , 1982 .

[74]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[75]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..