A Survey of Uncertain Data Algorithms and Applications

In recent years, a number of indirect data collection methodologies have lead to the proliferation of uncertain data. Such data points are often represented in the form of a probabilistic function, since the corresponding deterministic value is not known. This increases the challenge of mining and managing uncertain data, since the precise behavior of the underlying data is no longer known. In this paper, we provide a survey of uncertain data mining and management applications. In the field of uncertain data management, we will examine traditional methods such as join processing, query processing, selectivity estimation, OLAP queries, and indexing. In the field of uncertain data mining, we will examine traditional mining problems such as classification and clustering. We will also examine a general transform based technique for mining uncertain data. We discuss the models for uncertain data, and how they can be leveraged in a variety of applications. We discuss different methodologies to process and mine uncertain data in a variety of forms.

[1]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[2]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[3]  Parag Agrawal,et al.  Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo) , 2007, CIDR.

[4]  Suk Kyoon Lee,et al.  An Extended Relational Database Model for Uncertain and Imprecise Information , 1992, VLDB.

[5]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[6]  Christian Genest,et al.  Combining Probability Distributions: A Critique and an Annotated Bibliography , 1986 .

[7]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[8]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[9]  Alex Dekhtyar,et al.  A Framework for Management of Semistructured Probabilistic Data , 2005, Journal of Intelligent Information Systems.

[10]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[11]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[12]  Sally I. McClean,et al.  Aggregation of Imprecise and Uncertain Information for Knowledge Discovery in Databases , 1998, KDD.

[13]  Fereidoon Sadri,et al.  Modeling uncertainty in databases , 1991, [1991] Proceedings. Seventh International Conference on Data Engineering.

[14]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[16]  Philip S. Yu,et al.  A Framework for Clustering Uncertain Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[18]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[21]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[22]  Christian Böhm,et al.  Querying Objects Modeled by Arbitrary Probability Distributions , 2007, SSTD.

[23]  Dan Olteanu,et al.  From complete to incomplete information and back , 2007, SIGMOD '07.

[24]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  Dieter Pfoser,et al.  Capturing the Uncertainty of Moving-Object Representations , 1999, SSD.

[26]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[27]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[28]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[29]  Graham Cormode,et al.  Approximation algorithms for clustering uncertain data , 2008, PODS.

[30]  Robert B. Ross,et al.  Aggregate operators in probabilistic databases , 2005, JACM.

[31]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[32]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[33]  Amihai Motro,et al.  Accommodating imprecision in database systems: issues and solutions , 1990, SGMD.

[34]  Amihai Motro,et al.  Sources of Uncertainty, Imprecision, and Inconsistency in Information Systems , 1996, Uncertainty Management in Information Systems.

[35]  Raghu Ramakrishnan,et al.  OLAP over Imprecise Data with Domain Constraints , 2007, VLDB.

[36]  Maurice van Keulen,et al.  A probabilistic XML approach to data integration , 2005, 21st International Conference on Data Engineering (ICDE'05).

[37]  Ben Kao,et al.  A Decremental Approach for Mining Frequent Itemsets from Uncertain Data , 2008, PAKDD.

[38]  Dan Olteanu,et al.  $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[39]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[40]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, IEEE Transactions on Knowledge and Data Engineering.

[41]  Charu C. Aggarwal,et al.  On Density Based Transforms for Uncertain Data Mining , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[42]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[43]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[44]  Ambuj K. Singh,et al.  Top-k Spatial Joins of Probabilistic Objects , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[45]  Gösta Grahne,et al.  The Problem of Incomplete Information in Relational Databases , 1991, Lecture Notes in Computer Science.

[46]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[47]  Christian Böhm,et al.  The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[48]  Alon Y. Halevy,et al.  Using Probabilistic Information in Data Integration , 1997, VLDB.

[49]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[50]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[51]  Edward Hung ProbSem: A Probabilistic Semistructured Database Model , 2002 .

[52]  Torben Bach Pedersen,et al.  Supporting imprecision in multidimensional databases using granularities , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[53]  Charu C. Aggarwal On Unifying Privacy and Uncertain Data Models , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[54]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[55]  Henri Prade,et al.  Generalizing Database Relational Algebra for the Treatment of Incomplete/Uncertain Information and Vague Queries , 1984, Inf. Sci..

[56]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[57]  Esteban Zimányi,et al.  Query Evaluation in Probabilistic Relational Databases , 1997, Theor. Comput. Sci..

[58]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[59]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[60]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[61]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[62]  Elke A. Rundensteiner,et al.  Evaluating aggregates in possibilistic relational databases , 1992, Data Knowl. Eng..

[63]  Arbee L. P. Chen,et al.  Evaluating Aggregate Operations Over Imprecise Data , 1996, IEEE Trans. Knowl. Data Eng..

[64]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[65]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[66]  David A. Bell,et al.  Generalized Union and Project Operations for Pooling Uncertain and Imprecise Information , 1996, Data Knowl. Eng..

[67]  Christopher Ré,et al.  Query Evaluation on Probabilistic Databases , 2006, IEEE Data Eng. Bull..

[68]  Sumit Sarkar,et al.  PSQL: A Query Language for Probabilistic Relational Data , 1998, Data Knowl. Eng..

[69]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[70]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[71]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[72]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[73]  Ambuj K. Singh,et al.  APLA: Indexing Arbitrary Probability Distributions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[74]  Jeffrey Scott Vitter,et al.  Efficient join processing over uncertain data , 2006, CIKM '06.

[75]  Serge Abiteboul,et al.  On the representation and querying of sets of possible worlds , 1987, SIGMOD '87.

[76]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.