Instance based Matching using Regular Expression

Instance based matching is the process of comparing data from different heterogeneous data sources in determining the correspondence of schema elements. It is a useful alternative choice when schema information (element name, description, constraint) is unavailable or unable to determine the match between schema elements. Instance based matching is a non trivial problem and is applied in many application areas such as data integration, data cleaning, query mediations, and warehousing. Many instance based solutions to the schema matching problem have been proposed and most of them utilized similarity metrics. In this paper, we present a fully automatic approach that contributes to the solution of instance based matching in identifying the correspondences of attributes which is one of the elements in the schema by utilizing regular expression. Several experiments using real-world data set have been conducted to evaluate the performance of our proposed approach. The results showed that our proposed approach achieved better accuracy compared to previous approaches using similarity metrics.

[1]  Mohammadreza Ektefa A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage , 2011 .

[2]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[3]  Sudha Ram,et al.  Combining schema and instance information for integrating heterogeneous data sources , 2007, Data Knowl. Eng..

[4]  Christoph Gollmick Client-Oriented Replication in Mobile Database Environments , 2003 .

[5]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[6]  Weiming Zhang,et al.  Schema matching using neural network , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[7]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[8]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[9]  Stefan Conrad,et al.  Instance-Based Ontology Matching Using Different Kinds of Formalism , 2009 .

[10]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[11]  Jaideep Srivastava,et al.  Mining Entity-Identification Rules for Database Integration , 1996, KDD.

[12]  Emma Tonkin,et al.  Mastering Regular Expressions, 3rd Edition (Review) , 2007 .

[13]  Bin Gao,et al.  An Effective Content-Based Schema Matching Algorithm , 2008, 2008 International Seminar on Future Information Technology and Management Engineering.

[14]  François Yvon,et al.  Robust Similarity Measures for Named Entities Matching , 2008, COLING.

[15]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[16]  Yan Liang An instance-based approach for domain-independent schema matching , 2008, ACM-SE 46.

[17]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[18]  Don X. Sun,et al.  Methods for Linking and Mining Massive Heterogeneous Databases , 1998, KDD.

[19]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[20]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[24]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[25]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[26]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[27]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[28]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[29]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[30]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..