Interoperable framework for improving data quality using semantic approach: use case on biodiversity

Today the Internet growing exponentially and revolutionizing everything with increasing number of users everywhere in order to meet the superfluous demand has triggered an unprecedented wave of various kinds of digital data on the Web. Among them much of the data is relevant and can be turned into actionable insights but difficulties to face are that handling such a hype of data on the Web and due to its unstructured format can not meet the pre-set requirements of professionals and end users. In the context of biodiversity domain, a conceptual approach of data science has been proposed in this paper to extract and structure data seamlessly, which makes sense of all biodiversity-rich data and multiple-record documents by saving time and energy. The major drawback in manual extraction and storage of biodiversity data is that it gives rise to several errors (such as spelling errors, skipping of some data fields etc.) which can be difficult to improve during the processing stage, thereafter can not meet the research demands. However, such drawbacks can be dealt if data science approach is applied within the system and this automated approach will be fast, flexible, reliable and accurate. Nevertheless, the only thing to be taken care in the extraction approach is regular monitoring and analysis of Hypertext Markup Language (HTML) structure, documents, and links of target sources. Such a huge set of data contains many error and noisy characters; to eliminate these errors, data cleaning algorithm has been used to make data error-free and ready for further systematic research. Due to the wide variety of data formats, achieving interoperability is a daunting task, since some of the datasets do not follow their own schema structure. To cope with this demand, semantic interoperability has proved to be helpful by exchanging data through web services between different independent loosely coupled systems. This paper presents an overview of semantic interoperability and case studies on various projects that implemented it for biodiversity data sharing.

[1]  Robert Meersman,et al.  The Use of Lexicons and Other Computer-Linguistic Tools in Semantics, Design and Cooperation of Database Systems , 1999, CODAS.

[2]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[3]  Torsten Suel,et al.  Interactive wrapper generation with minimal user effort , 2006, WWW '06.

[4]  Heiner Stuckenschmidt,et al.  Ontology-Based Integration of Information - A Survey of Existing Approaches , 2001, OIS@IJCAI.

[5]  Walter Daelemans,et al.  Pattern for Python , 2012, J. Mach. Learn. Res..

[6]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[7]  Nicola Guarino,et al.  Sweetening WORDNET with DOLCE , 2003, AI Mag..

[8]  James Martin,et al.  Information engineering , 1981 .

[9]  Andrew C. Jones Applying Computer Science Research to Biodiversity Informatics: Some Experiences and Lessons , 2006, Trans. Comp. Sys. Biology.

[10]  Robin Cooper,et al.  Integrating Diverse Information Resources Into Dialogue Updates , 1997 .

[11]  Henrik Eriksson,et al.  The evolution of Protégé: an environment for knowledge-based systems development , 2003, Int. J. Hum. Comput. Stud..

[12]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[13]  Carole Goble,et al.  A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control , 2014, Biodiversity data journal.

[14]  R. J. White,et al.  SPICE: A Flexible Architecture for Integrating Autonomous Databases to Comprise a Distributed Catalogue of Life , 2000, DEXA.

[15]  C. Finkelstein An Introduction to Information Engineering: From Strategic Planning to Information Systems , 1989 .

[16]  Harish Karnatak,et al.  India ’ s plant diversity database at landscape level on geospatial platform : prospects and utility in today ’ s changing climate , 2012 .

[17]  Ganesh K. Pakle,et al.  A Survey on HTML Structure Aware and Tree Based Web Data Scraping Technique , 2014 .

[18]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[19]  Marc Friedman,et al.  Efficiently Executing Information-Gathering Plans , 1997, IJCAI.

[20]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[21]  Wuu Yang,et al.  Identifying syntactic differences between two programs , 1991, Softw. Pract. Exp..

[22]  Graziano Pesole,et al.  UvA-DARE ( Digital Academic Repository ) BioVeL : a virtual laboratory for data analysis and modelling in biodiversity science and ecology , 2016 .

[23]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[24]  Sophia Ananiadou,et al.  A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository , 2016, SIMBig.

[25]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[26]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[27]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.

[28]  Raymond J. Mooney,et al.  Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction , 2003, J. Mach. Learn. Res..

[29]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[30]  Richard Hull,et al.  Managing semantic heterogeneity in databases: a theoretical prospective , 1997, PODS.

[31]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..

[32]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[33]  M. de Rijke,et al.  Automatic Wrapper Generation for Web Search Engines , 2000, Web-Age Information Management.

[34]  N. Arora Biodiversity conservation for sustainable future , 2018, Environmental Sustainability.

[35]  William Bohrer,et al.  Carnot and InfoSleuth: database technology and the World Wide Web , 1995, SIGMOD '95.

[36]  Diego Calvanese,et al.  A Framework for Ontology Integration , 2001, The Emerging Semantic Web.

[37]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[38]  B Praveen Kumar,et al.  Mariposa a Wide-Area Distributed Database System , 2010, ICCA 2010.

[39]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[40]  Priyanka Singh,et al.  Species Mapping Using Citizen Science Approach Through IBIN Portal: Use Case in Foothills of Himalaya , 2018, Journal of the Indian Society of Remote Sensing.

[41]  Craig A. Knoblock,et al.  Query reformulation for dynamic information integration , 1996, Journal of Intelligent Information Systems.

[42]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[43]  David R. Karger,et al.  Thresher: automating the unwrapping of semantic content from the World Wide Web , 2005, WWW '05.

[44]  Yannis Kalfoglou,et al.  Ontology mapping: the state of the art , 2003, The Knowledge Engineering Review.

[45]  Julio Alonso-Arévalo,et al.  Gestores de referencias de última generación: análisis comparativo de RefWorks, EndNote Web y Zotero , 2009 .

[46]  Dennis McLeod,et al.  A federated architecture for information management , 1985, TOIS.

[47]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[48]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[49]  Bing Liu,et al.  Structured data extraction from the web , 2006 .

[50]  Khaled Shaalan,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2007 .

[51]  S. Sonsilphong,et al.  Rule-based semantic web services annotation for healthcare information integration , 2012, 2012 8th International Conference on Computing and Networking Technology (INC, ICCIS and ICMIC).

[52]  Amit P. Sheth,et al.  Using Tickets to Enforce the Serializability of Multidatabase Transactions , 1994, IEEE Trans. Knowl. Data Eng..

[53]  Nick Roussopoulos,et al.  Interoperability of multiple autonomous databases , 1990, CSUR.

[54]  D. Kossmann,et al.  What can you do with a Web in your Pocket ? , 2007 .

[55]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[56]  Sameer Saran,et al.  Indian Bioresource Information Network (IBIN) , 2018, Remote Sensing of Northwest Himalayan Ecosystems.

[57]  Laura M. Haas,et al.  Schema Mapping as Query Discovery , 2000, VLDB.

[58]  Nuno Silva,et al.  Ontology Mapping for Interoperability in Semantic Web , 2003, ICWI.

[59]  George P. Huber,et al.  A theory of the effects of advanced information technologies on organizational design, intelligence , 1990 .

[60]  K. Selçuk Candan,et al.  Query caching and optimization in distributed mediator systems , 1996, SIGMOD '96.

[61]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[62]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[63]  Berthier A. Ribeiro-Neto,et al.  Extracting semi-structured data through examples , 1999, CIKM '99.

[64]  Jamie Murphy,et al.  Take Me Back: Validating the Wayback Machine , 2007, J. Comput. Mediat. Commun..

[65]  Roderic D. M. Page,et al.  Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library , 2011, BMC Bioinformatics.

[66]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[67]  Chia-Hui Chang,et al.  OLERA: Semisupervised Web-Data Extraction with Visual Support , 2004, IEEE Intell. Syst..

[68]  José A. Blakeley Universal data access with OLE DB , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[69]  Hyoil Han,et al.  A survey on ontology mapping , 2006, SGMD.

[70]  Walter Jetz,et al.  Mapping the biodiversity of tropical insects: species richness and inventory completeness of African sphingid moths , 2013 .

[71]  Han Zhao,et al.  Semantic Model Based Heterogeneous Databases Integration Platform , 2007, Third International Conference on Natural Computation (ICNC 2007).

[72]  Ines Fischer,et al.  Weaving The Web The Original Design And Ultimate Destiny Of The World Wide Web By Its Inventor , 2016 .

[73]  Pedro M. Domingos,et al.  Learning to Match the Schemas of Data Sources: A Multistrategy Approach , 2003, Machine Learning.

[74]  Amit P. Sheth,et al.  Semantic interoperability in global information systems , 1999, SGMD.

[75]  S da SilvaAltigran,et al.  A brief survey of web data extraction tools , 2002 .

[76]  A. Elmagarmid Database transaction models for advanced applications , 1992 .

[77]  Yuri Breitbart,et al.  Multidatabase Interoperability , 1990, SGMD.

[78]  JoAnne Yates,et al.  Electronic markets and electronic hierarchies , 1987, CACM.

[79]  Amit P. Sheth,et al.  Changing Focus on Interoperability in Information Systems:From System, Syntax, Structure to Semantics , 1999 .

[80]  Ian Horrocks,et al.  On-To-Knowledge: Ontology-based Tools for Knowledge Management , 2000 .

[81]  Laura M. Haas,et al.  Information integration in the enterprise , 2008, CACM.

[82]  N. Arora,et al.  Environmental Sustainability—necessary for survival , 2018, Environmental Sustainability.

[83]  M. Zimmerman Weaving the web: the original design and ultimate destiny of the world wide web by its inventor [Book Review] , 2000, IEEE Transactions on Professional Communication.

[84]  Tim Sutton,et al.  How Global Is the Global Biodiversity Information Facility? , 2007, PloS one.

[85]  Alberto H. F. Laender,et al.  Automatic generation of agents for collecting hidden Web pages for data extraction , 2004, Data Knowl. Eng..

[86]  Olegas Vasilecas,et al.  Advances in Databases and Information Systems (ADBIS) , 2002, SIGMOD Rec..

[87]  Peter M. G. Apers Identifying Internet-related Database Research , 1994, East/West Database Workshop.

[88]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[89]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[90]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[91]  Alberto H. F. Laender,et al.  DEByE - Data Extraction By Example , 2002, Data Knowl. Eng..

[92]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[93]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[94]  Georg Gottlob,et al.  Web Data Extraction System , 2009, Encyclopedia of Database Systems.

[95]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[96]  J. Silvertown A new dawn for citizen science. , 2009, Trends in ecology & evolution.

[97]  T. Ceccarelli,et al.  Towards a planning support system for communal areas in the Zambezi Valley, Zimbabwe: a multi criteria evaluation linking farm household analysis, land evaluation and geographic information systems. , 1997 .

[98]  Mark A. Musen,et al.  PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment , 2000, AAAI/IAAI.

[99]  Piramanayagam Shanmughavel,et al.  An overview on biodiversity information in databases , 2007, Bioinformation.

[100]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[101]  Andrew B. Whinston,et al.  AIDING DECISION MAKERS WITH A GENERALIZED DATA BASE MANAGEMENT SYSTEM: AN APPLICATION TO INVENTORY MANAGEMENT * , 1978 .

[102]  Patricia S Wilson,et al.  What mapping and modeling means to the HIM professional. , 2007, Perspectives in health information management.

[103]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[104]  Vipul Kashyap,et al.  So Far (Schematically) yet So Near (Semantically) , 1992, DS-5.

[105]  Peter P. Chen The entity-relationship model: toward a unified view of data , 1975, VLDB '75.

[106]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[107]  Kim H. Veltman,et al.  Syntactic and semantic interoperability: New approaches to knowledge and the semantic web , 2001 .

[108]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases, and webs , 1999 .

[109]  Vincent S. Smith,et al.  No specimen left behind: industrial scale digitization of natural history collections , 2012, ZooKeys.

[110]  Paolo Papotti,et al.  Nested mappings: schema mapping reloaded , 2006, VLDB.

[111]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.

[112]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[113]  Dennis McLeod,et al.  On Database Management System Architecture. , 1979 .

[114]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[115]  Roger King,et al.  Report of the Workshop on Semantic Heterogeneity and Interpolation in multidatabase Systems , 1993, SGMD.

[116]  Maurizio Vincini,et al.  Synthesizing an Integrated Ontology , 2003, IEEE Internet Comput..

[117]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[118]  Nadine Cullot,et al.  Database-to-Ontology Mapping Generation for Semantic Interoperability , 2007 .

[119]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[120]  SahuguetArnaud,et al.  Building intelligent web applications using lightweight wrappers , 2001 .

[121]  Nowak Joanna,et al.  Issues of Multilinguality in Creating a European SDI - The Perspective for Spatial Data Interoperability , 2005 .

[122]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[123]  Asunción Gómez-Pérez,et al.  Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web , 2002, Lecture Notes in Computer Science.