Instance-based attribute identification in database integration

Abstract.Most research on attribute identification in database integration has focused on integrating attributes using schema and summary information derived from the attribute values. No research has attempted to fully explore the use of attribute values to perform attribute identification. We propose an attribute identification method that employs schema and summary instance information as well as properties of attributes derived from their instances. Unlike other attribute identification methods that match only single attributes, our method matches attribute groups for integration. Because our attribute identification method fully explores data instances, it can identify corresponding attributes to be integrated even when schema information is misleading. Three experiments were performed to validate our attribute identification method. In the first experiment, the heuristic rules derived for attribute classification were evaluated on 119 attributes from nine public domain data sets. The second was a controlled experiment validating the robustness of the proposed attribute identification method by introducing erroneous data. The third experiment evaluated the proposed attribute identification method on five data sets extracted from online music stores. The results demonstrated the viability of the proposed method.

[1]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[4]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[5]  L. Gitlin,et al.  Introduction to research. , 1973, Nursing times.

[6]  Brian Everitt,et al.  Cluster analysis , 1974 .

[7]  Raymond Fadous,et al.  Finding candidate keys for relational data bases , 1975, SIGMOD '75.

[8]  Claudio L. Lucchesi,et al.  Candidate Keys for Relations , 1978, J. Comput. Syst. Sci..

[9]  T. Cook,et al.  Quasi-experimentation: Design & analysis issues for field settings , 1979 .

[10]  L. Delbeke Quasi-experimentation - design and analysis issues for field settings - cook,td, campbell,dt , 1980 .

[11]  James Jaccard,et al.  Statistics for the Behavioral Sciences , 1983 .

[12]  V. Greaney Equality of opportunity in Irish schools , 1984 .

[13]  Rolph E. Anderson,et al.  Multivariate data analysis with readings (2nd ed.) , 1986 .

[14]  Paolo Toth,et al.  Linear Assignment Problems , 1987 .

[15]  James A. Larson,et al.  A Theory of Attribute Equivalence in Databases with Application to Schema Integration , 1989, IEEE Trans. Software Eng..

[16]  Amit P. Sheth,et al.  Attribute Relationships: An Impediment in Automating Schema Integration , 1989 .

[17]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[18]  Clement T. Yu,et al.  Determining relationships among attributes for interoperability of multi-database systems , 1991, [1991] Proceedings. First International Workshop on Interoperability in Multidatabase Systems.

[19]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[20]  Clement T. Yu,et al.  Determining relationships among names in heterogeneous databases , 1991, SGMD.

[21]  R. Summers,et al.  The Penn World Table (Mark 5): An Expanded Set of International Comparisons, 1950-1987 , 1991 .

[22]  M. Kendall,et al.  Rank Correlation Methods (5th ed.). , 1992 .

[23]  U. Sekaran,et al.  Research Methods for Business : A Skill Building Approach (5th Edition) , 1992 .

[24]  Jaideep Srivastava,et al.  Entity identification in database integration , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[25]  A. Raftery,et al.  The Effects of Family Disruption on Social Mobility. , 1993 .

[26]  Chris Clifton,et al.  Using field specifications to determine attribute equivalence in heterogeneous databases , 1993, Proceedings RIDE-IMS `93: Third International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems.

[27]  Jaideep Srivastava,et al.  Entity Identification in Database Integration: An Evidential Reasoning Approach , 1993 .

[28]  Tom Wansbeek,et al.  Identification, Equivalent Models, and Computer Algebra , 1994 .

[29]  Chris Clifton,et al.  Semantic Integration in Heterogeneous Databases Using Neural Networks , 1994, VLDB.

[30]  Brad Perry,et al.  Applying a Data Miner To Heterogeneous Schema Integration , 1995, KDD.

[31]  Isabelle Mirbel,et al.  Semantic Integration of Conceptual Schemas , 1997, Data Knowl. Eng..

[32]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[33]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[34]  Veda C. Storey,et al.  A Framework for the Design and Evaluation of Reverse Engineering Methods for Relational Databases , 1996, Data Knowl. Eng..

[35]  Jaideep Srivastava,et al.  Mining Entity-Identification Rules for Database Integration , 1996, KDD.

[36]  Paul D. Scott,et al.  SNOUT: An Intelligent Assistant for Exploratory Data Anaylsis , 1997, PKDD.

[37]  J. Fox Applied Regression Analysis, Linear Models, and Related Methods , 1997 .

[38]  J. Leon Zhao,et al.  Schema coordination in federated database management: a comparison with schema integration , 1997, Decis. Support Syst..

[39]  S. Fienberg,et al.  A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data , 1997 .

[40]  Peter Scheuermann,et al.  Multidatabase query processing with uncertainty in global keys and attribute values , 1998 .

[41]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[42]  Philip S. Yu,et al.  Mining Large Itemsets for Association Rules , 1998, IEEE Data Eng. Bull..

[43]  Cecil Eng Huang Chua,et al.  A Heuristic Method for Correlating Attribute Group Pairs in Data Mining , 1998, ER Workshops.

[44]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[45]  Silvana Castano,et al.  Conceptual schema analysis: techniques and applications , 1998, TODS.

[46]  Sumit Sarkar,et al.  A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases , 1998 .

[47]  J. A. Calvin Regression Models for Categorical and Limited Dependent Variables , 1998 .

[48]  Ee-Peng Lim,et al.  Tuple Source Relational Model: A Source-Aware Data Model for Multidatabases , 1999, Data Knowl. Eng..

[49]  Silvana Castano,et al.  A schema analysis and reconciliation tool environment for heterogeneous databases , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[50]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[51]  Alfons Kemper,et al.  Bulletin of the Ieee Computer Society Technical Committee on Data Engineering , 1999 .

[52]  Robert B. Burns,et al.  Introduction to Research Methods , 2015, Research Methods for Political Science.

[53]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[54]  Amit P. Sheth,et al.  The Carnot Heterogeneous Database Project: Implemented Applications , 1997, Distributed and Parallel Databases.