Provenance and Data Synchronization

Replication increases the availability of data in mobile an d distributed systems. For example, if we copy calendar data from a web service onto a mobile device, the calendar can be accessed even when the network cannot. In peer-based data sharing systems, maintaining a copy of the s har d data on a local node enables query answering when remote peers are offline, guarantees privacy, and impro ves performance. But along with these advantages, replication brings complications: whenever one replica is updated, the others also need to be refreshed to keep the whole system consistent. Therefore, in systems built on replication, synchronization mechanisms are critical. In simple applications, the replicas are just that—carbon c opies of each other. But often the copied data needs to be transformed in different ways on each replica. For exam ple, web services and mobile devices represent calendars in different formats (iCal vs. Palm Datebook). Li kewise, in data sharing systems for scientific data, the peers usually have heterogeneous schemas. In these more complicated systems, the replicas behave like views, and so mechanisms for updating and maintaining views are also important. The mapping between sources and views defined by a query is not ge erally one-to-one. This loss of information is what makes view update and view maintenance difficu lt. It has often been observed that provenance — i.e., metadata that tracks the origins of values as they flow t hrough a query—could be used to cope with this loss of information and help with these problems [5, 6, 4, 24], but only a few existing systems (e.g., AutoMed [12]) use provenance in this way, and only for limited classes of vi ews. This article presents a pair of case studies illustrating ho w provenance can be incorporated into systems for handling replicated data. The first describes how proven ance is used inlensesfor ordered data [2]. Lenses define updatable views, which are used to handle heterogeneo us r plicas in the Harmony synchronization framework [23, 13]. They track a simple, implicit form of provenan ce and use it to express the complex update policies needed to correctly handle ordered data. The second case stu dy describes ORCHESTRA[17, 19], a collaborative data sharing system [22]. In O RCHESTRA, data is distributed across tables located on many differen t p ers, and the relationship between connected peers is specified using GLAV [16] schema mappings. Every node coalesces data from remote peers and uses its own copy of the data to answ er queries over the distributed dataset. Provenance is used to perform incremental maintenance of each pee r as updates are applied to remote peers, and to filter “incoming” updates according to trust conditions.