Provenance in Scientific Workflow Systems

The volume of information in natural languages in electronic format is increasing exponentially. The demographics of users of information management systems are becoming increasingly multilingual. Together these trends create a requirement for information management systems to support processing of information in multiple natural languages seamlessly. Database systems, the backbones of information management, should support this requirement effectively and efficiently. Earlier research in this area had proposed multilingual operators [7, 8] for relational database systems, and discussed their implementation using existing database features. In this paper, we specifically focus on the SemEQUAL operator [8], implementing a multilingual semantic matching predicate using WordNet [12]. We explore the implementation of SemEQUAL using OrdPath [10], a positional representation for nodes of a hierarchy that is used successfully for supporting XML documents in relational systems. We propose the use of OrdPath to represent position within the Wordnet hierarchy, leveraging its ability to compute transitive closures efficiently. We show theoretically that an implementation using OrdPath will outperform those implementations proposed previously. Our initial experimental results confirm this analysis, and show that the OrdPath implementation performs significantly better. Further, since our technique is not specifically rooted to linguistic hierarchies, the same approach may benefit other applications that utilize alternative hierarchical ontologies.

[1]  Carl Vogel,et al.  The Topology of WordNet: Some Metrics , 2004 .

[2]  Cláudio T. Silva,et al.  Using Provenance to Streamline Data Exploration through Visualization (SCI Institute Technical Report, No. UUSCI-2006-016) , 2006 .

[3]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[4]  Christoph Koch,et al.  Processing queries on tree-structured data efficiently , 2006, PODS.

[5]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[6]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[7]  Bertram Ludäscher,et al.  Project Histories: Managing Data Provenance Across Collection-Oriented Scientific Workflow Runs , 2007, DILS.

[8]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[9]  Patrick E. O'Neil,et al.  ORDPATHs: insert-friendly XML node labels , 2004, SIGMOD '04.

[10]  Carmem S. Hara,et al.  Querying and Managing Provenance through User Views in Scientific Workflows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Bertram Ludäscher,et al.  A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows , 2006, IPAW.

[12]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[13]  Edward A. Lee,et al.  Dataflow process networks , 1995, Proc. IEEE.

[14]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[15]  Susan B. Davidson,et al.  Zoom*UserViews: Querying Relevant Provenance in Workflow Systems , 2007, VLDB.

[16]  Bertram Ludäscher,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat , 2008 .

[17]  Michael Stonebraker,et al.  Readings in Database Systems: Fourth Edition , 2005 .

[18]  Catriel Beeri,et al.  Monitoring Business Processes with Queries , 2007, VLDB.

[19]  FosterIan,et al.  Report on the International Provenance and Annotation Workshop , 2006 .

[20]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[21]  Jayant R. Haritsa,et al.  LexEQUAL: multilexical matching operator in SQL , 2004, SIGMOD '04.

[22]  Bertram Ludäscher,et al.  Actor-Oriented Design of Scientific Workflows , 2005, ER.

[23]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[24]  Jayant R. Haritsa,et al.  SemEQUAL: Multilingual Semantic Matching in Relational Systems , 2005, DASFAA.

[25]  Eric Brewer,et al.  Combining Systems and Databases: A Search Engine Retrospective , 2004 .