Improving Integration Effectiveness of ID Mapping Based Biological Record Linkage

Traditionally, biological objects such as genes, proteins, and pathways are represented by a convenient identifier, or ID, which is then used to cross reference, link and describe objects in biological databases. Relationships among the objects are often established using non-trivial and computationally complex ID mapping systems or converters, and are stored in authoritative databases such as UniGene, GeneCards, PIR and BioMart. Despite best efforts, such mappings are largely incomplete and riddled with false negatives. Consequently, data integration using record linkage that relies on these mappings produces poor quality of data, inadvertently leading to erroneous conclusions. In this paper, we discuss this largely ignored dimension of data integration, examine how the ubiquitous use of identifiers in biological databases is a significant barrier to knowledge fusion using distributed computational pipelines, and propose two algorithms for ad hoc and restriction free ID mapping of arbitrary types using online resources. We also propose two declarative statements for ID conversion and data integration based on ID mapping on-the-fly.

[1]  Susan Darling Urban,et al.  Object-oriented query language access to relational databases: A semantic framework for query translation , 1995, J. Syst. Integr..

[2]  Edleno Silva de Moura,et al.  A Probabilistic Approach for Automatically Filling Form-Based Web Interfaces , 2010, Proc. VLDB Endow..

[3]  Audrey Bihouée,et al.  Bioinformatics Applications Note Gene Expression Madgene: Retrieval and Processing of Gene Identifier Lists for the Analysis of Heterogeneous Microarray Datasets , 2022 .

[4]  Hasan M. Jamil,et al.  An Efficient Web-Based Wrapper and Annotator for Tabular Data , 2010, Int. J. Softw. Eng. Knowl. Eng..

[5]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[6]  Rada Chirkova,et al.  Materialized Views , 2012, Found. Trends Databases.

[7]  Bumjin Kim,et al.  IdBean: a Java GUI application for conversion of biological identifiers. , 2011, BMB reports.

[8]  Joaquín Dopazo,et al.  BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments , 2006, Nucleic Acids Res..

[9]  Hans-Werner Mewes,et al.  CRONOS: the cross-reference navigation server , 2009, Bioinform..

[10]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[11]  A. Sarah Walker,et al.  An efficient record linkage scheme using graphical analysis for identifier error detection , 2011, BMC Medical Informatics Decis. Mak..

[12]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[13]  Surajit Chaudhuri,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications. , 1995 .

[14]  M. Tamer Özsu,et al.  A comprehensive XQuery to SQL translation using dynamic interval encoding , 2003, SIGMOD '03.

[15]  Hyeonjin Kim,et al.  IdMapper: A Java Application for ID Mapping across Multiple Cross-referencing Providers , 2009 .

[16]  Fei Ren,et al.  Accessing Deep Web Using Automatic Query Translation Technique , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[17]  Ming Yi,et al.  bioDBnet: the biological database network , 2009, Bioinform..

[18]  Peter L. Mott,et al.  LeedsCQ : A Scalable Continual Queries System , 2002, DEXA.

[19]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[20]  Shazzad Hosain,et al.  On-the-Fly Integration and Ad Hoc Querying of Life Sciences Databases Using LifeDB , 2009, DEXA.

[21]  Anjana Gosain,et al.  A comprehensive study of view maintenance approaches in data warehousing evolution , 2012, SOEN.

[22]  Purvesh Khatri,et al.  Babel's tower revisited: a universal resource for cross-referencing across annotation databases , 2006, Bioinform..

[23]  Calton Pu,et al.  BizCQ: using continual queries to cope with changes in business information exchange , 2004, WWW Alt. '04.

[24]  Raghu Ramakrishnan,et al.  Toward best-effort information extraction , 2008, SIGMOD Conference.

[25]  Clement T. Yu,et al.  Translation of object-oriented queries to relational queries , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[26]  Frederick P. Roth,et al.  The Synergizer service for translating gene, protein and other biological identifiers , 2008, Bioinform..

[27]  Calton Pu,et al.  Continual Queries for Internet Scale Event-Driven Information Delivery , 1999, IEEE Trans. Knowl. Data Eng..

[28]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[29]  Brad T. Sherman,et al.  DAVID gene ID conversion tool , 2008, Bioinformation.

[30]  Jignesh M. Patel,et al.  The Role of Declarative Querying in Bioinformatics , 2003, OMICS.

[31]  Peter Tarczy-Hornoch,et al.  Biomediator Data Integration and Inference for Functional Annotation of Anonymous Sequences , 2006, Pacific Symposium on Biocomputing.

[32]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[33]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[34]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[35]  Teruhiko Yoshida,et al.  Genome‐wide germline analyses on cancer susceptibility and GeMDBJ database: Gastric cancer as an example , 2010, Cancer science.

[36]  Calton Pu,et al.  Conquer: A Continual Query System for Update Monitoring in the WWW , 1999 .

[37]  Ahmed K. Elmagarmid,et al.  Behavior based record linkage , 2010, Proc. VLDB Endow..

[38]  Aminul Islam,et al.  A declarative language and toolkit for scientific workflow implementation and execution , 2010, Int. J. Bus. Process. Integr. Manag..

[39]  Juliana Freire,et al.  PruSM: a prudent schema matching approach for web forms , 2010, CIKM.

[40]  Oto Vale,et al.  An Evolutionary Method for Natural Language to SQL Translation , 2008, SEAL.

[41]  Divesh Srivastava,et al.  Record linkage with uniqueness constraints and erroneous values , 2010, Proc. VLDB Endow..

[42]  Derek E. Wildman,et al.  IDChase: Mitigating Identifier Migration Trap in Biological Databases , 2009, IC3.

[43]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[44]  Hasan M. Jamil,et al.  Designing Integrated Computational Biology Pipelines Visually , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  Ramón Díaz-Uriarte,et al.  IDconverter and IDClight: Conversion and annotation of gene and protein IDs , 2007, BMC Bioinformatics.

[46]  Hongjun Lu,et al.  Query translation from XPath to SQL in the presence of recursive DTDs , 2009, The VLDB Journal.

[47]  Martin Senger,et al.  BioMoby extensions to the Taverna workflow management and enactment software , 2006, BMC Bioinformatics.

[48]  Anton Nekrutenko,et al.  Integrating diverse databases into an unified analysis framework: a Galaxy approach , 2011, Database J. Biol. Databases Curation.

[49]  Anupam Bhattacharjee,et al.  OntoMatch: A monotonically improving schema matching system for autonomous data integration , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[50]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[51]  Guoying Liu,et al.  NetAffx: Affymetrix probesets and annotations , 2003, Nucleic Acids Res..