Mining Frequent Itemsets in Uncertain Datasets

Data in real world are usually noisy or uncertain. However, traditional data mining algorithms ignore the uncertainty in data or take it into consideration in a very limited way. In this paper, we define a relatively generic model for uncertainty in data in which each data item comes with a “tag” that defines the degree of confidence in that value. This is more realistic in many cases where the data items are derived from other evidence or more basic data. Simple examples are face recognition and fingerprint identification where, for example, the raw data itself can influence the degree of confidence in the identification. As an example problem, in this paper we study frequent itemset mining in such uncertain data. With uncertain data, finding frequent itemsets will not be perfect. There will be false positives (itemsets which are estimated to be frequent but which are not) and false negatives (frequent itemsets which are estimated not to be frequent). We consider several intuitive approaches and propose a new scheme which significantly reduces the number of false positives and false negatives.