Automotive big data: Applications, workloads and infrastructures

Data is increasingly affecting the automotive industry, from vehicle development, to manufacturing and service processes, to online services centered around the connected vehicle. Connected, mobile and Internet of Things devices and machines generate immense amounts of sensor data. The ability to process and analyze this data to extract insights and knowledge that enable intelligent services, new ways to understand business problems, improvements of processes and decisions, is a critical capability. Hadoop is a scalable platform for compute and storage and emerged as de-facto standard for Big Data processing at Internet companies and in the scientific community. However, there is a lack of understanding of how and for what use cases these new Hadoop capabilities can be efficiently used to augment automotive applications and systems. This paper surveys use cases and applications for deploying Hadoop in the automotive industry. Over the years a rich ecosystem emerged around Hadoop comprising tools for parallel, in-memory and stream processing (most notable MapReduce and Spark), SQL and NOSQL engines (Hive, HBase), and machine learning (Mahout, MLlib). It is critical to develop an understanding of automotive applications and their characteristics and requirements for data discovery, integration, exploration and analytics. We then map these requirements to a confined technical architecture consisting of core Hadoop services and libraries for data ingest, processing and analytics. The objective of this paper is to address questions, such as: What applications and datasets are suitable for Hadoop? How can a diverse set of frameworks and tools be managed on multi-tenant Hadoop cluster? How do these tools integrate with existing relational data management systems? How can enterprise security requirements be addressed? What are the performance characteristics of these tools for real-world automotive applications? To address the last question, we utilize a standard benchmark (TPCx-HS), and two application benchmarks (SQL and machine learning) that operate on a dataset of multiple Terabytes and billions of rows.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  Linh Ngo,et al.  Synthetic data generation for the internet of things , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[4]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[5]  M. Srujana,et al.  Traffic Signal Phase and Timing Estimation From Low-Frequency Transit Bus Data , 2016 .

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Geoffrey C. Fox,et al.  Towards a Comprehensive Set of Big Data Benchmarks , 2014, High Performance Computing Workshop.

[8]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[9]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[10]  Jignesh M. Patel,et al.  Profiling R on a Contemporary Processor , 2014, Proc. VLDB Endow..

[11]  Raghunath Othayoth Nambiar,et al.  Introducing TPCx-HS: The First Industry Standard for Benchmarking Big Data Systems , 2014, TPCTC.

[12]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[13]  Carlo Curino,et al.  Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data , 2014, TPCTC.

[14]  George C. Caragea,et al.  Orca: a modular query optimizer architecture for big data , 2014, SIGMOD Conference.

[15]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[16]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[17]  Judy Qiu,et al.  Harp: Collective Communication on Hadoop , 2015, 2015 IEEE International Conference on Cloud Engineering.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[20]  Judy Qiu,et al.  A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures , 2014, 2014 IEEE International Congress on Big Data.

[21]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[22]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[25]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[26]  George Ostrouchov,et al.  Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes , 2017, Big Data Res..

[27]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.