论文信息 - Text joins for data cleansing and integration in an RDBMS

Text joins for data cleansing and integration in an RDBMS

An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.

[1] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[2] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[3] Edith Cohen,et al. Approximating matrix multiplication for pattern recognition tasks , 1997, SODA '97.

[4] William W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[5] Ronald Fagin,et al. Static index pruning for information retrieval systems , 2001, SIGIR '01.

[6] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7] Surajit Chaudhuri,et al. Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[8] William R. Hersh,et al. Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[9] Ophir Frieder,et al. Integrating structured data and text: a relational approach , 1997 .

[10] Dennis Shasha,et al. Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[11] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[12] Alistair Moffat,et al. Vector-space ranking with effective early termination , 2001, SIGIR '01.