Indexing Uncertain Categorical Data

Uncertainty in categorical data is commonplace in many applications, including data cleaning, database integration, and biological annotation. In such domains, the correct value of an attribute is often unknown, but may be selected from a reasonable number of alternatives. Current database management systems do not provide a convenient means for representing or manipulating this type of uncertainty. In this paper we extend traditional systems to explicitly handle uncertainty in data values. We propose two index structures for efficiently searching uncertain categorical data, one based on the R-tree and another based on an inverted index structure. Using these structures, we provide a detailed description of the probabilistic equality queries they support. Experimental results using real and synthetic datasets demonstrate how these index structures can effectively improve the performance of queries through the use of internal probabilistic information.

[1]  Patrick Bosc,et al.  About projection-selection-join queries addressed to possibilistic relational databases , 2005, IEEE Transactions on Fuzzy Systems.

[2]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[3]  Patrick Bosc,et al.  Indexing principles for a fuzzy data base , 1989, Inf. Syst..

[4]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  José Galindo,et al.  Fuzzy Databases: Modeling, Design, and Implementation , 2006 .

[6]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[7]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[8]  Mohand Boughanem,et al.  Management of Uncertainty and Imprecision in Multimedia Information Systems: Introducing this Special Issue , 2003, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[9]  Hidetomo Ichihashi,et al.  Fuzzy clustering for categorical multivariate data , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[10]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[11]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[12]  Christos Faloutsos,et al.  Signature Files , 1992, Information Retrieval: Data Structures & Algorithms.

[13]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[14]  Lotfi A. Zadeh,et al.  Fuzzy sets and systems , 1990 .

[15]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[16]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[17]  Sven Helmer,et al.  Evaluating different approaches for indexing fuzzy sets , 2003, Fuzzy Sets Syst..

[18]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[19]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[20]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[21]  Nikos Mamoulis,et al.  Similarity search in sets and categorical data using the signature tree , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[22]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[23]  Sven Helmer,et al.  Index structures for efficiently accessing fuzzy data including cost models and measurements , 1999, Fuzzy Sets Syst..

[24]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[25]  Dan Suciu,et al.  Towards correcting input data errors probabilistically using integrity constraints , 2006, MobiDE '06.