A MapReduce-Based Parallel Frequent Pattern Growth Algorithm for Spatiotemporal Association Analysis of Mobile Trajectory Big Data

Frequent pattern mining is an effective approach for spatiotemporal association analysis of mobile trajectory big data in data-driven intelligent transportation systems. While existing parallel algorithms have been successfully applied to frequent pattern mining of large-scale trajectory data, two major challenges are how to overcome the inherent defects of Hadoop to cope with taxi trajectory big data including massive small files and how to discover the implicitly spatiotemporal frequent patterns with MapReduce. To conquer these challenges, this paper presents a MapReduce-based Parallel Frequent Pattern growth (MR-PFP) algorithm to analyze the spatiotemporal characteristics of taxi operating using large-scale taxi trajectories with massive small file processing strategies on a Hadoop platform. More specifically, we first implement three methods, that is, Hadoop Archives (HAR), CombineFileInputFormat (CFIF), and Sequence Files (SF), to overcome the existing defects of Hadoop and then propose two strategies based on their performance evaluations. Next, we incorporate SF into Frequent Pattern growth (FP-growth) algorithm and then implement the optimized FP-growth algorithm on a MapReduce framework. Finally, we analyze the characteristics of taxi operating in both spatial and temporal dimensions by MR-PFP in parallel. The results demonstrate that MR-PFP is superior to existing Parallel FP-growth (PFP) algorithm in efficiency and scalability.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Yantao Li,et al.  An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division , 2015 .

[3]  Fuzhen Zhuang,et al.  A parallel incremental extreme SVM classifier , 2011, Neurocomputing.

[4]  Qinghua Zheng,et al.  A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files , 2010, 2010 IEEE International Conference on Services Computing.

[5]  Eleni I. Vlahogianni,et al.  Big data in transportation and traffic engineering , 2015 .

[6]  Robert B. Ross,et al.  Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Natawut Nupairoj,et al.  Improving performance of small-file accessing in Hadoop , 2014, 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[8]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[9]  Brian Castellani,et al.  Cases, clusters, densities: Modeling the nonlinear dynamics of complex health trajectories , 2016, Complex..

[10]  Xue-wen Chen,et al.  Big Data Deep Learning: Challenges and Perspectives , 2014, IEEE Access.

[11]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[12]  Luming Zhang,et al.  Special issue on big data driven Intelligent Transportation Systems , 2016, Neurocomputing.

[13]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[14]  Arshid Ahmed Shah,et al.  Improving Hadoop Performance in Handling Small Files , 2018 .

[15]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[16]  Licia Capra,et al.  Urban Computing: Concepts, Methodologies, and Applications , 2014, TIST.

[17]  Limin Xiao,et al.  Small Files Problem in Parallel File System , 2011, 2011 International Conference on Network Computing and Information Security.

[18]  Xiaoshe Dong,et al.  Small files storing and computing optimization in Hadoop parallel rendering , 2015, ICNC.

[19]  Xiaoshe Dong,et al.  Small files storing and computing optimization in Hadoop parallel rendering , 2015, 2015 11th International Conference on Natural Computation (ICNC).

[20]  Jin Chang,et al.  Balanced parallel FP-Growth with MapReduce , 2010, 2010 IEEE Youth Conference on Information, Computing and Telecommunications.

[21]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[22]  Zhaohui Wu,et al.  Trace analysis and mining for smart cities: issues, methods, and applications , 2013, IEEE Communications Magazine.

[23]  Christina Klüver,et al.  Word morph and topological structures: A graph generating algorithm , 2016, Complex..

[24]  I-En Liao,et al.  An improved frequent pattern growth method for mining association rules , 2011, Expert Syst. Appl..

[25]  Jing-Rung Yu,et al.  FIUT: A new method for mining frequent itemsets , 2009, Inf. Sci..

[26]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[27]  Zili Zhang,et al.  Discovery and Analysis of Usage Data Based on Hadoop for Personalized Information Access , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Zili Zhang,et al.  A distributed spatial-temporal weighted model on MapReduce for short-term traffic flow forecasting , 2016, Neurocomputing.

[30]  Laks V. S. Lakshmanan,et al.  Efficient mining of constrained correlated sets , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[31]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[32]  Zili Zhang,et al.  A Map Reduce-Based Nearest Neighbor Approach for Big-Data-Driven Traffic Flow Prediction , 2016, IEEE Access.

[33]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[34]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[35]  Byeong-Soo Jeong,et al.  Parallel and Distributed Algorithms for Frequent Pattern Mining in Large Databases , 2009 .

[36]  Fei-Yue Wang,et al.  Data-Driven Intelligent Transportation Systems: A Survey , 2011, IEEE Transactions on Intelligent Transportation Systems.

[37]  Yu Zheng,et al.  Trajectory Data Mining , 2015, ACM Trans. Intell. Syst. Technol..

[38]  Arbee L. P. Chen,et al.  Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis , 2011, Data Mining and Knowledge Discovery.