An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management

Smart urban transportation management can be considered as a multifaceted big data challenge. It strongly relies on the information collected into multiple, widespread, and heterogeneous data sources as well as on the ability to extract actionable insights from them. Besides data, full stack (from platform to services and applications) Information and Communications Technology (ICT) solutions need to be specifically adopted to address smart cities challenges. Smart urban transportation management is one of the key use cases addressed in the context of the EUBra-BIGSEA (Europe-Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications) project. This paper specifically focuses on the City Administration Dashboard, a public transport analytics application that has been developed on top of the EUBra-BIGSEA platform and used by the Municipality stakeholders of Curitiba, Brazil, to tackle urban traffic data analysis and planning challenges. The solution proposed in this paper joins together a scalable big and fast data analytics platform, a flexible and dynamic cloud infrastructure, data quality and entity matching algorithms as well as security and privacy techniques. By exploiting an interoperable programming framework based on Python Application Programming Interface (API), it allows an easy, rapid and transparent development of smart cities applications.

[1]  Taghi M. Khoshgoftaar,et al.  A survey of open source tools for machine learning with big data in the Hadoop ecosystem , 2015, Journal of Big Data.

[2]  Cinzia Cappiello,et al.  Quality awareness for a Successful Big Data Exploitation , 2018, IDEAS.

[3]  Jameela Al-Jaroodi,et al.  Applications of big data to smart cities , 2015, Journal of Internet Services and Applications.

[4]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[5]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[6]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[7]  Raymond H. Putra,et al.  Bus trajectory identification by map-matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[8]  Piyushimita Thakuriah,et al.  Digital Infomediaries and Civic Hacking in Emerging Urban Data Initiatives , 2017 .

[9]  P. Mahesh,et al.  Passenger Journey Destination Estimation From Automated Fare Collection System Data Using Spatial Validation , 2017 .

[10]  Deep Ganguli,et al.  Druid: a real-time analytical data store , 2014, SIGMOD Conference.

[11]  Carlo Ratti,et al.  Understanding individual mobility patterns from urban sensing data: A mobile phone trace example , 2013 .

[12]  Eloy Romero,et al.  Self-managed cost-efficient virtual elastic clusters on hybrid Cloud infrastructures , 2016, Future Gener. Comput. Syst..

[13]  Matheus Maciel,et al.  BIGSEA: A Big Data analytics platform for public transportation information , 2019, Future Gener. Comput. Syst..

[14]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[15]  Nishant Garg Learning Apache Kafka - Second Edition , 2015 .

[16]  D. Watling,et al.  Big data and understanding change in the context of planning transport systems , 2019, Journal of Transport Geography.

[17]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[18]  Ian T. Foster,et al.  Ophidia: Toward Big Data Analytics for eScience , 2013, ICCS.

[19]  Wu Xiaofei Origin-destination Matrix Estimation Method of Public Transportaion Flow Based on Data From Bus Integrated-circuit Cards , 2012 .

[20]  Khaled El Emam,et al.  De-identifying a public use microdata file from the Canadian national discharge abstract database , 2011, BMC Medical Informatics Decis. Mak..

[21]  Ian T. Foster,et al.  A big data analytics framework for scientific data management , 2013, 2013 IEEE International Conference on Big Data.

[22]  Tânia Basso,et al.  Towards an Ontology-Based Definition of Data Anonymization Policy for Cloud Computing and Big Data , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[23]  Etienne Côme,et al.  Short & long term forecasting of multimodal transport passenger flows with machine learning methods , 2017, 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC).

[24]  Takeshi Arai,et al.  Estimation of Passenger Origin-Destination Matrices and Efficiency Evaluation of Public Transportation , 2016, 2016 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI).

[25]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[26]  Jorge Ejarque,et al.  Transparent Orchestration of Task-based Parallel Applications in Containers Platforms , 2018, Journal of Grid Computing.

[27]  Ian T. Foster,et al.  An in-memory based framework for scientific data analytics , 2016, Conf. Computing Frontiers.

[28]  Peter White,et al.  The Potential of Public Transport Smart Card Data , 2005 .

[29]  Hatem Ben Sta,et al.  Quality and the efficiency of data in "Smart-Cities" , 2017, Future Gener. Comput. Syst..

[30]  Mario Piattini,et al.  A Data Quality in Use model for Big Data , 2016, Future Gener. Comput. Syst..

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[33]  Alexander Erath,et al.  Transport modelling in the age of big data , 2017 .

[34]  James M. Tien,et al.  Big Data: Unleashing information , 2013, 2013 10th International Conference on Service Systems and Service Management.

[35]  GhemawatSanjay,et al.  The Google file system , 2003 .

[36]  Sandro Fiore,et al.  A Re-Identification Risk-Based Anonymization Framework for Data Analytics Platforms , 2018, 2018 14th European Dependable Computing Conference (EDCC).

[37]  Ian T. Foster,et al.  Ophidia: A full software stack for scientific data analytics , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[38]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[39]  K. Emam,et al.  Evaluating the Risk of Re-identification of Patients from Hospital Prescription Records. , 2009, The Canadian journal of hospital pharmacy.

[40]  Nor Badrul Anuar,et al.  The role of big data in smart city , 2016, Int. J. Inf. Manag..

[41]  Tao Tang,et al.  Big Data Analytics in Intelligent Transportation Systems: A Survey , 2019, IEEE Transactions on Intelligent Transportation Systems.

[42]  P. Maglio,et al.  Smart cities with big data: Reference models, challenges, and considerations , 2018, Cities.

[43]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[44]  Nazareno Andrade,et al.  Estimating Inefficiency in Bus Trip Choices From a User Perspective With Schedule, Positioning, and Ticketing Data , 2018, IEEE Transactions on Intelligent Transportation Systems.

[45]  Nuno Antunes,et al.  Challenges on Anonymity, Privacy, and Big Data , 2016, 2016 Seventh Latin-American Symposium on Dependable Computing (LADC).

[46]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[47]  U. Berardi,et al.  Smart Cities: Definitions, Dimensions, Performance, and Initiatives , 2015 .

[48]  Danilo Ardagna,et al.  Context-aware data quality assessment for big data , 2018, Future Gener. Comput. Syst..

[49]  Carlos Eduardo S. Pires,et al.  An efficient spark-based adaptive windowing for entity matching , 2017, J. Syst. Softw..

[50]  Marcela Munizaga,et al.  Estimation of a disaggregate multimodal public transport Origin-Destination matrix from passive smartcard data from Santiago, Chile , 2012 .

[51]  Behshid Behkamal,et al.  Big Data Quality: A systematic literature review and future research directions , 2019, ArXiv.

[52]  Wagner Meira,et al.  PRIVAaaS: Privacy Approach for a Distributed Cloud-Based Data Analytics Platforms , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[53]  Kwang Sik Kim,et al.  Performance assessment of bus transport reform in Seoul , 2011 .

[54]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[55]  Thorsten Dahm,et al.  The TACACS+ Protocol , 2020 .

[56]  A. Urbanek Data-Driven Transport Policy in Cities: A Literature Review and Implications for Future Developments , 2018, Integration as Solution for Advanced Smart Urban Transport Systems.

[57]  Jorge Ejarque,et al.  COMP Superscalar, an interoperable programming framework , 2015 .

[58]  Nelio Cacho,et al.  Using Big Data and Real-Time Analytics to Support Smart City Initiatives , 2016 .

[59]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[60]  Keemin Sohn,et al.  Deep-learning architecture to forecast destinations of bus passengers from entry-only smart-card data , 2017 .

[61]  Felix Naumann,et al.  Data profiling revisited , 2014, SGMD.

[62]  Dominique Genoud,et al.  Big Data in Smart Cities: From Poisson to Human Dynamics , 2014, 2014 28th International Conference on Advanced Information Networking and Applications Workshops.

[63]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.

[64]  Dazhi Sun,et al.  Smart Card Data Mining of Public Transport Destination: A Literature Review , 2018, Inf..

[65]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[66]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.