Mining Order-Preserving Submatrices Under Data Uncertainty: A Possible-World Approach

Given a data matrix D, a submatrix S of D is an order-preserving submatrix (OPSM) if there is a permutation of the columns of S, under which the entry values of each row in S are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes, and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work has ever combated uncertain value intervals using the possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support based and (2) probabilistic frequentness based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, and several effective pruning rules are introduced to efficiently prune insignificant OPSMs. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori respectively. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.

[1]  Wei Wang,et al.  Mining Approximate Order Preserving Clusters in the Presence of Noise , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[3]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[4]  Wilfred Ng,et al.  Mining order-preserving submatrices from probabilistic matrices , 2014, TODS.

[5]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[6]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[7]  Wilfred Ng,et al.  Discovering significant relaxed order-preserving submatrices , 2010, KDD '10.

[8]  Sau Dan Lee,et al.  Mining Order-Preserving Submatrices from Data with Repeated Measurements , 2008, IEEE Transactions on Knowledge and Data Engineering.

[9]  Wilfred Ng,et al.  Robust Ranking of Uncertain Data , 2011, DASFAA.

[10]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[11]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[12]  Michael K. Ng,et al.  On Mining Micro-array data by Order-Preserving Submatrix , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[13]  Wilfred Ng,et al.  Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases , 2014, IEEE Transactions on Knowledge and Data Engineering.

[14]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[15]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[16]  Wilfred Ng,et al.  Mining probabilistically frequent sequential patterns in uncertain databases , 2012, EDBT '12.

[17]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[18]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[19]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  A KonstanJoseph,et al.  The MovieLens Datasets , 2015 .