Hardening soft information sources

The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases are \soft": they contain inconsistencies and duplication, and lack unique, consistently-used object identi ers. Examples include large bibliographic databases harvested from raw scienti c papers or databases constructed by merging heterogeneous \hard" databases. Here we formally model a soft database as a noisy version of some unknown hard database. We then consider the hardening problem, i.e., the problem of inferring the most likely underlying hard database given a particular soft database. A key feature of our approach is that hardening is global | many sources of evidence for a given hard fact are taken into account. We formulate hardening as an optimization problem and give a nontrivial nearly linear time algorithm for nding a local optimum.