Extraction by Example: Induction of Structural Rules for the Analysis of Molecular Sequence Data from Heterogeneous Sources

Biological research requires information from multiple data sources that use a variety of database-specific formats. Manual gathering of information is time consuming and error-prone, making automated data aggregation a compelling option for large studies. We describe a method for extracting information from diverse sources that involves structural rules specified by example. We developed a system for aggregation of biological knowledge (ABK) and used it to conduct an epidemiological study of dengue virus (DENV) sequences. Additional information on geographical origin and isolation date is critical for understanding evolutionary relationships, but this data is inconsistently structured in database entries. Using three public databases, we found that structural rules can be used successfully even when applied on inconsistently structured data that is distributed across multiple fields. High reusability, combined with the ability to integrate analysis tools, make this method suitable for a wide variety of large-scale studies involving viral sequences.

[1]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[2]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[3]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[4]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[5]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[6]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[7]  S. Chung,et al.  Kleisli: a new tool for data integration in biology. , 1999, Trends in biotechnology.

[8]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[9]  Peter D. Karp,et al.  Database verification studies of SWISS-PROT and GenBank , 2001, Bioinform..

[10]  Alvis Brazma,et al.  On the Importance of Standardisation in Life Sciences , 2001, Bioinform..

[11]  Rolf Apweiler,et al.  The EBI SRS server-new features , 2002, Bioinform..

[12]  Marli Cordeiro,et al.  Genome analysis of dengue type-1 virus isolated between 1990 and 2001 in Brazil reveals a remarkable conservation of the structural proteins but amino acid differences in the non-structural proteins. , 2002, Virus research.

[13]  Vladimir Brusic,et al.  BioWare : A framework for bioinformatics data retrieval , annotation and publishing , 2004 .

[14]  Michael Y. Galperin,et al.  The Molecular Biology Database Collection: 2004 update , 2004, Nucleic Acids Res..

[15]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[16]  Michael Y. Galperin The Molecular Biology Database Collection: 2005 update , 2004, Nucleic Acids Res..

[17]  Yoshio Umezawa,et al.  A high-throughput screening of genes that encode proteins transported into the endoplasmic reticulum in mammalian cells , 2005, Nucleic acids research.

[18]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[19]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.