Real-time intelligent big data processing: technology, platform, and applications

Human beings keep exploring the physical space using information means. Only recently, with the rapid development of information technologies and the increasing accumulation of data, human beings can learn more about the unknown world with data-driven methods. Given data timeliness, there is a growing awareness of the importance of real-time data. There are two categories of technologies accounting for data processing: batching big data and streaming processing, which have not been integrated well. Thus, we propose an innovative incremental processing technology named after Stream Cube to process both big data and stream data. Also, we implement a real-time intelligent data processing system, which is based on real-time acquisition, real-time processing, real-time analysis, and real-time decision-making. The real-time intelligent data processing technology system is equipped with a batching big data platform, data analysis tools, and machine learning models. Based on our applications and analysis, the real-time intelligent data processing system is a crucial solution to the problems of the national society and economy.

[1]  Josiah L. Carlson,et al.  Redis in Action , 2013 .

[2]  Indranil Gupta,et al.  Stateful Scalable Stream Processing at LinkedIn , 2017, Proc. VLDB Endow..

[3]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[4]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[5]  Edward A. Lee,et al.  AWStream: adaptive wide-area streaming analytics , 2018, SIGCOMM.

[6]  Rayid Ghani,et al.  Big Data for Social Good , 2015, Big Data.

[7]  Mohamed H. Ali,et al.  An introduction to Microsoft SQL server StreamInsight , 2010, COM.Geo '10.

[8]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[10]  Wenxin Li,et al.  Wide-Area Spark Streaming: Automated Routing and Batch Sizing , 2017, 2017 IEEE International Conference on Autonomic Computing (ICAC).

[11]  Jun Rao,et al.  Liquid: Unifying Nearline and Offline Big Data Integration , 2015, CIDR.

[12]  Johannes Gehrke,et al.  Cayuga: A General Purpose Event Monitoring System , 2007, CIDR.

[13]  Jukka Riekki,et al.  Low latency analytics for streaming traffic data with Apache Spark , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[14]  Rajkumar Buyya,et al.  A Taxonomy and Survey of Stream Processing Systems , 2017 .

[15]  Shaiful Alam Chowdhury,et al.  Performance Evaluation of Yahoo! S4: A First Look , 2012, 2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[16]  Yunhe Pan,et al.  Heading toward Artificial Intelligence 2.0 , 2016 .

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Baochun Li,et al.  Wide-Area Spark Streaming: Automated Routing and Batch Sizing , 2019, IEEE Transactions on Parallel and Distributed Systems.

[19]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[20]  Holger Ziekow,et al.  Towards a Big Data Analytics Framework for IoT and Smart City Applications , 2015 .

[21]  V. Srinivasan,et al.  Aerospike: Architecture of a Real-Time Operational DBMS , 2016, Proc. VLDB Endow..

[22]  Hai Jin,et al.  Towards Low-Latency Batched Stream Processing by Pre-Scheduling , 2019, IEEE Transactions on Parallel and Distributed Systems.

[23]  Julian Hyde Data in Flight , 2009, ACM Queue.

[24]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[25]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[26]  M. Tamer Özsu,et al.  Distribution-Aware Stream Partitioning for Distributed Stream Processing Systems , 2018, BeyondMR@SIGMOD.

[27]  Julian Hyde Data in flight , 2010, CACM.

[28]  Tilmann Rabl,et al.  Scotty: Efficient Window Aggregation for Out-of-Order Stream Processing , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[29]  Qingsheng Zhu,et al.  Deadline-Constrained Cost Optimization Approaches for Workflow Scheduling in Clouds , 2017, IEEE Transactions on Parallel and Distributed Systems.

[30]  Raouf Boutaba,et al.  Cloud computing: state-of-the-art and research challenges , 2010, Journal of Internet Services and Applications.

[31]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.