Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases

Data uncertainty is inherent in many real-world applications such as environmental surveillance and mobile tracking. Mining sequential patterns from inaccurate data, such as those data arising from sensor readings and GPS trajectories, is important for discovering hidden knowledge in such applications. In this paper, we propose to measure pattern frequentness based on the possible world semantics. We establish two uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data, and formulate the problem of mining probabilistically frequent sequential patterns (or p-FSPs) from data that conform to our models. However, the number of possible worlds is extremely large, which makes the mining prohibitively expensive. Inspired by the famous PrefixSpan algorithm, we develop two new algorithms, collectively called U-PrefixSpan, for p-FSP mining. U-PrefixSpan effectively avoids the problem of “possible worlds explosion”, and when combined with our four pruning and validating methods, achieves even better performance. We also propose a fast validating method to further speed up our U-PrefixSpan algorithm. The efficiency and effectiveness of U-PrefixSpan are verified through extensive experiments on both real and synthetic datasets.

[1]  Feifei Li,et al.  Probabilistic string similarity joins , 2010, SIGMOD Conference.

[2]  Nikos Pelekis,et al.  Clustering uncertain trajectories , 2011, Knowledge and Information Systems.

[3]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[5]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[6]  Haixun Wang,et al.  Leveraging spatio-temporal redundancy for RFID data cleansing , 2010, SIGMOD Conference.

[7]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[8]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[9]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[10]  L. L. Cam,et al.  An approximation theorem for the Poisson binomial distribution. , 1960 .

[11]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[12]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[13]  Jianyong Wang,et al.  Direct mining of discriminative patterns for classifying uncertain data , 2010, KDD.

[14]  Reynold Cheng,et al.  Efficient Mining of Frequent Item Sets on Large Uncertain Databases , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Brigitte Trousse,et al.  Extracting Sequential Patterns for Gene Regulatory Expressions Profiles , 2004, KELSI.

[16]  Lei Chen,et al.  Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[17]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[18]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[19]  Jianzhong Li,et al.  Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics , 2010, KDD.

[20]  Reynold Cheng,et al.  Accelerating probabilistic frequent itemset mining: a model-based approach , 2010, CIKM.

[21]  Wilfred Ng,et al.  Mining probabilistically frequent sequential patterns in uncertain databases , 2012, EDBT '12.

[22]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[23]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[24]  A. Volkova A Refinement of the Central Limit Theorem for Sums of Independent Random Indicators , 1996 .

[25]  Yili Hong On Computing the Distribution Function for the Sum of Independent and Non-identical Random Indicators Yili Hong , 2011 .

[26]  Rajeev Raman,et al.  Mining sequential patterns from probabilistic databases , 2011, Knowledge and Information Systems.

[27]  Dino Pedreschi,et al.  Trajectory pattern mining , 2007, KDD '07.

[28]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[29]  Xiang Lian,et al.  Set similarity join on probabilistic data , 2010, Proc. VLDB Endow..

[30]  M. Altman,et al.  An optimum cubically convergent iterative method of inverting a linear bounded operator in Hilbert space. , 1960 .

[31]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[32]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.