Monte Carlo query processing of uncertain multidimensional array data

Array database systems are architected for scientific and engineering applications. In these applications, the value of a cell is often imprecise and uncertain. There are at least two reasons that a Monte Carlo query processing algorithm is usually required for such uncertain data. Firstly, a probabilistic graphical model must often be used to model correlation, which requires a Monte Carlo inference algorithm for the operations in our database. Secondly, mathematical operators required by science and engineering domains are much more complex than those of SQL. State-of-the-art query processing uses Monte Carlo approximation. We give an example of using Markov Random Fields combined with an array's chunking or tiling mechanism to model correlated data. We then propose solutions for two of the most challenging problems in this framework, namely the expensive array join operation, and the determination and optimization of stopping conditions of Monte Carlo query processing. Finally, we perform an extensive empirical study on a real world application.

[1]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[2]  Anna Liu,et al.  PODS: a new model and processing algorithms for uncertain data streams , 2010, SIGMOD Conference.

[3]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[4]  Marianne Winslett,et al.  An efficient abstract interface for multidimensional array I/O , 1994, Proceedings of Supercomputing '94.

[5]  Christoph Koch,et al.  PIP: A database system for great and small expectations , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Peter Baumann,et al.  The multidimensional database system RasDaMan , 1998, SIGMOD '98.

[8]  Michael Stonebraker,et al.  Efficient organization of large multidimensional arrays , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[9]  Stanley B. Zdonik,et al.  Handling Uncertain Data in Array Database Systems , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10]  Lise Getoor,et al.  An Introduction to Probabilistic Graphical Models for Relational Data , 2006, IEEE Data Eng. Bull..

[11]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[12]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Ihab F. Ilyas,et al.  Ranking with Uncertain Scores , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[14]  Yaakov Bar-Shalom,et al.  Multitarget/Multisensor Tracking: Applications and Advances -- Volume III , 2000 .

[15]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[16]  Subramanian Arumugam,et al.  Evaluation of probabilistic threshold queries in MCDB , 2010, SIGMOD Conference.

[17]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[18]  David Maier,et al.  Exploiting punctuation semantics in data streams , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Joel H. Saltz,et al.  Titan: a high-performance remote-sensing database , 1997, Proceedings 13th International Conference on Data Engineering.

[20]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[21]  Joel H. Saltz,et al.  T2: a customizable parallel database for multi-dimensional data , 1998, SGMD.

[22]  P. Jones,et al.  Uncertainty estimates in regional and global observed temperature changes: A new data set from 1850 , 2006 .

[23]  Stanley B. Zdonik,et al.  A*-tree , 2010, Proc. VLDB Endow..

[24]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[25]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[26]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[27]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[28]  L. Williams,et al.  Contents , 2020, Ophthalmology (Rochester, Minn.).

[29]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[30]  Yaakov Bar-Shalom,et al.  Multitarget-Multisensor Tracking: Applications and Advances , 1992 .

[31]  Kenneth Salem,et al.  Query processing techniques for arrays , 1999, SIGMOD '99.