Mining probabilistically frequent sequential patterns in uncertain databases

Data uncertainty is inherent in many real-world applications such as environmental surveillance and mobile tracking. As a result, mining sequential patterns from inaccurate data, such as sensor readings and GPS trajectories, is important for discovering hidden knowledge in such applications. Previous work uses expected support as the measurement of pattern frequentness, which has inherent weaknesses with respect to the underlying probability model, and is therefore ineffective for mining high-quality sequential patterns from uncertain sequence databases. In this paper, we propose to measure pattern frequentness based on the possible world semantics. We establish two uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data, and formulate the problem of mining probabilistically frequent sequential patterns (or p-FSPs) from data that conform to our models. Based on the prefix-projection strategy of the famous PrefixSpan algorithm, we develop two new algorithms, collectively called U-PrefixSpan, for p-FSP mining. U-PrefixSpan effectively avoids the problem of "possible world explosion", and when combined with our three pruning techniques and one validating technique, achieves good performance. The efficiency and effectiveness of U-PrefixSpan are verified through extensive experiments on both real and synthetic datasets.

[1]  Jianzhong Li,et al.  Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics , 2010, KDD.

[2]  Nikos Pelekis,et al.  Clustering uncertain trajectories , 2011, Knowledge and Information Systems.

[3]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[4]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[5]  Feifei Li,et al.  Probabilistic string similarity joins , 2010, SIGMOD Conference.

[6]  Jianyong Wang,et al.  Direct mining of discriminative patterns for classifying uncertain data , 2010, KDD.

[7]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[8]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[9]  Lei Chen,et al.  Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[10]  Reynold Cheng,et al.  Accelerating probabilistic frequent itemset mining: a model-based approach , 2010, CIKM.

[11]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[12]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[14]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[15]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[16]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[17]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[18]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[19]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[20]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[21]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[22]  L. L. Cam,et al.  An approximation theorem for the Poisson binomial distribution. , 1960 .

[23]  Haixun Wang,et al.  Leveraging spatio-temporal redundancy for RFID data cleansing , 2010, SIGMOD Conference.

[24]  Xiang Lian,et al.  Set similarity join on probabilistic data , 2010, Proc. VLDB Endow..

[25]  Brigitte Trousse,et al.  Extracting Sequential Patterns for Gene Regulatory Expressions Profiles , 2004, KELSI.

[26]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[27]  A. Volkova A Refinement of the Central Limit Theorem for Sums of Independent Random Indicators , 1996 .

[28]  Rajeev Raman,et al.  Mining sequential patterns from probabilistic databases , 2011, Knowledge and Information Systems.

[29]  Dan Suciu,et al.  Probabilistic Event Extraction from RFID Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[30]  M. Balazinska,et al.  PEEX : Extracting Probabilistic Events from RFID Data , 2007 .

[31]  Dino Pedreschi,et al.  Trajectory pattern mining , 2007, KDD '07.

[32]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[33]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.