Schema Matching and Data Integration with Consistent Naming on Protein Crystallization Screens

The data representation as well as naming conventions used in commercial screen files by different companies make the automated analysis of crystallization experiments difficult and time-consuming. In order to reduce the human effort required to deal with this problem, we present an approach for computationally matching elements of two schemas using linguistic schema matching methods and then transform the input screen format to another format with naming defined by the user. This approach is tested on a number of commercial screens from different companies and the results of the experiments showed an overall accuracy of 97 percent on schema matching which is significantly better than the other two matchers we tested. Our tool enables mapping a screen file in one format to another format preferred by the expert using their preferred chemical names.

[1]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[2]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[3]  Madhav Sigdel,et al.  Real-Time Protein Crystallization Image Acquisition and Classification System. , 2013, Crystal growth & design.

[4]  J. Newman,et al.  What's in a Name? Moving Towards a Limited Vocabulary for Macromolecular Crystallisation , 2014 .

[5]  Ramazan Savas Aygün,et al.  Optimizing genetic algorithm for protein crystallization screening using an exploratory fitness function , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[6]  Hamideh Afsarmanesh,et al.  Using linguistic techniques for schema matching , 2006, ICSOFT.

[7]  F. Gorrec Protein crystallization screens developed at the MRC Laboratory of Molecular Biology , 2016, Drug discovery today.

[8]  Thomas S. Peat,et al.  The C6 Web Tool: A Resource for the Rational Selection of Crystallization Conditions , 2010 .

[9]  Ramazan Savas Aygün,et al.  Visual-X2: Scoring and visualization tool for analysis of protein crystallization trial images , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[10]  Jennifer Neville,et al.  Supporting Relational Knowledge Discovery: Lessons in Architecture and Algorithm Design , 2002 .

[11]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[13]  Jan A. Kors,et al.  Consistency of systematic chemical identifiers within and between small-molecule databases , 2012, Journal of Cheminformatics.

[14]  Schema Matching And Mapping-based Data Integration , 2005 .

[15]  Ramazan Savas Aygün,et al.  Schema matching and data integration on protein crystallization screens , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[16]  Ramazan Savas Aygün,et al.  Protein Crystallization Screening Using Associative Experimental Design , 2015, ISBRA.

[17]  Florian Matthes,et al.  Testing & quality assurance in data migration projects , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Marc L. Pusey,et al.  Optimizing Associative Experimental Design for Protein Crystallization Screening , 2016, IEEE Transactions on NanoBioscience.

[20]  Tanja Hedderich,et al.  PICKScreens, A New Database for the Comparison of Crystallization Screens for Biological Macromolecules , 2011 .

[21]  Ken Samuel,et al.  Integration Workbench: Integrating Schema Integration Tools , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[22]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[23]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[24]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[25]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.