BIGSEA: A Big Data analytics platform for public transportation information

Abstract Analysis of public transportation data in large cities is a challenging problem. Managing data ingestion, data storage, data quality enhancement, modelling and analysis requires intensive computing and a non-trivial amount of resources. In EUBra-BIGSEA (Europe–Brazil Collaboration of Big Data Scientific Research Through Cloud-Centric Applications) we address such problems in a comprehensive and integrated way. EUBra-BIGSEA provides a platform for building up data analytic workflows on top of elastic cloud services without requiring skills related to either programming or cloud services. The approach combines cloud orchestration, Quality of Service and automatic parallelisation on a platform that includes a toolbox for implementing privacy guarantees and data quality enhancement as well as advanced services for sentiment analysis, traffic jam estimation and trip recommendation based on estimated crowdedness. All developments are available under Open Source licenses ( http://github.org/eubr-bigsea , https://hub.docker.com/u/eubrabigsea/ ).

[1]  Rajiv Ranjan,et al.  Open Issues in Scheduling Microservices in the Cloud , 2016, IEEE Cloud Computing.

[2]  Eloy Romero,et al.  Self-managed cost-efficient virtual elastic clusters on hybrid Cloud infrastructures , 2016, Future Gener. Comput. Syst..

[3]  Nada Lavrac,et al.  ClowdFlows: Online workflows for distributed big data mining , 2017, Future Gener. Comput. Syst..

[4]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[5]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[6]  Samir Tata,et al.  CompatibleOne: The Open Source Cloud Broker , 2013, Journal of Grid Computing.

[7]  Rossano Schifanella,et al.  The shortest path to happiness: recommending beautiful, quiet, and happy routes in the city , 2014, HT.

[8]  Isabel Campos Plasencia,et al.  Phenomenology tools on cloud infrastructures using OpenStack , 2012, The European Physical Journal C.

[9]  Stephen F. Lundstrom,et al.  Predicting Performance of Parallel Computations , 1990, IEEE Trans. Parallel Distributed Syst..

[10]  Glen Hart,et al.  A tool for matching crowd-sourced and authoritative geospatial data , 2015, 2015 International Conference on Military Communications and Information Systems (ICMCIS).

[11]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[12]  Julián Garrido,et al.  Web Services as Building Blocks for Science Gateways in Astrophysics , 2015, 2015 7th International Workshop on Science Gateways.

[13]  Satish K. Tripathi,et al.  On Performance Prediction of Parallel Computations with Precedent Constraints , 2000, IEEE Trans. Parallel Distributed Syst..

[14]  Dieter Kranzlmüller,et al.  Building an open source cloud environment with auto-scaling resources for executing bioinformatics and biomedical workflows , 2017, Future Gener. Comput. Syst..

[15]  Nuno Laranjeiro,et al.  An Analysis of OpenStack Vulnerabilities , 2017, 2017 13th European Dependable Computing Conference (EDCC).

[16]  Sasu Tarkoma,et al.  A scalable infrastructure for CMS data analysis based on OpenStack Cloud and Gluster file system , 2014 .

[17]  Fabian Prasser,et al.  Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool , 2015, Medical Data Privacy Handbook.

[18]  Douglas Thain,et al.  Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker , 2015, VTDC@HPDC.

[19]  Andreas Wilke,et al.  Skyport - Container-Based Execution Environment Management for Multi-cloud Scientific Workflows , 2014, 2014 5th International Workshop on Data-Intensive Computing in the Clouds.

[20]  Ana Paula Couto da Silva,et al.  Performance Prediction of Cloud-Based Big Data Applications , 2018, ICPE.

[21]  Rajiv Ranjan,et al.  Holistic Performance Monitoring of Hybrid Clouds: Complexities and Future Directions , 2016, IEEE Cloud Computing.

[22]  Ian T. Foster,et al.  Ophidia: Toward Big Data Analytics for eScience , 2013, ICCS.

[23]  Shiyong Lu,et al.  Enabling scalable scientific workflow management in the Cloud , 2015, Future Gener. Comput. Syst..

[24]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[25]  Nuno Antunes,et al.  Challenges on Anonymity, Privacy, and Big Data , 2016, 2016 Seventh Latin-American Symposium on Dependable Computing (LADC).

[26]  Ignacio Blanquer,et al.  Dynamic Management of Virtual Infrastructures , 2015, Journal of Grid Computing.

[27]  Jordi Torres,et al.  PyCOMPSs: Parallel computational workflows in Python , 2016, Int. J. High Perform. Comput. Appl..

[28]  Pablo Prieto,et al.  The impact of Docker containers on the performance of genomic pipelines , 2015, PeerJ.

[29]  Álvaro López García,et al.  Orchestrating Complex Application Architectures in Heterogeneous Clouds , 2017, Journal of Grid Computing.

[30]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[31]  Donald F. Towsley,et al.  Computing Performance Bounds of Fork-Join Parallel Programs Under a Multiprocessing Environment , 1998, IEEE Trans. Parallel Distributed Syst..

[32]  Raymond H. Putra,et al.  Bus trajectory identification by map-matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[33]  Mohammed Al-Zobbi,et al.  Implementing A Framework for Big Data Anonymity and Analytics Access Control , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[34]  Tao Zhang,et al.  How to Find a Comfortable Bus Route - Towards Personalized Information Recommendation Services , 2015 .

[35]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[36]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[37]  Nazareno Andrade,et al.  Estimating Inefficiency in Bus Trip Choices From a User Perspective With Schedule, Positioning, and Ticketing Data , 2018, IEEE Transactions on Intelligent Transportation Systems.

[38]  Michael Stonebraker,et al.  The Architecture of SciDB , 2011, SSDBM.

[39]  Ian T. Foster,et al.  An in-memory based framework for scientific data analytics , 2016, Conf. Computing Frontiers.

[40]  Prokopios Drogkaris,et al.  A Privacy Preserving Framework for Big Data in e-Government Environments , 2015, TrustBus.

[41]  Asser N. Tantawi,et al.  Approximate Analysis of Fork/Join Synchronization in Parallel Queues , 1988, IEEE Trans. Computers.

[42]  Andrey Brito,et al.  Vertical elasticity on Marathon and Chronos Mesos frameworks , 2019, J. Parallel Distributed Comput..

[43]  Péter Kacsuk,et al.  Infrastructure Aware Scientific Workflows and Infrastructure Aware Workflow Managers in Science Gateways , 2016, Journal of Grid Computing.

[44]  Giuseppe Serazzi,et al.  Performance Driven WS Orchestration and Deployment in Service Oriented Infrastructure , 2014, Journal of Grid Computing.

[45]  Domenico Talia,et al.  ServiceSs: An Interoperable Programming Framework for the Cloud , 2013, Journal of Grid Computing.

[46]  David B. Stockton,et al.  Automating NEURON Simulation Deployment in Cloud Resources , 2016, Neuroinformatics.

[47]  Carlos Eduardo S. Pires,et al.  Towards Reliable Data Analyses for Smart Cities , 2017, IDEAS.

[48]  Vicente Hernández,et al.  An Energy Manager for High Performance Computer Clusters , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.