Shared State for Client-Server Mining

For many organizations the explosive growth in data collection techniques and database technology has resulted in large and dynamically growing datasets. These organizations are increasingly turning to data mining, the process of extracting useful information from such datasets. These datasets are typically in a remote repository accessible via a local or inter-network. Despite advances in processing speed and networking technology remote data mining is difficult because of the conflicting requirements imposed by the size of the data involved and the interactive aspect of data mining. The size of the datasets prohibit transferring the entire data to the remote client(s). In addition, data mining is often an iterative process with the user tweaking the supplied parameters according to domain-specific knowledge. This compounds the problem of increased response times due to network and server delays. We have shown [32] that often these applications can be structured so that subsequent requests can operate on relatively small summary data structures. Once the summary structure is computed and communicated to the client, interactions can take place on the client without further communication with the server. The summary is based on the snapshot of the actual data at any point in time. If the data is dynamically being modified, the summary is likely to change. In this scenario, the client’s copy of the summary structure must be kept up-to-date. Traditional realizations of this communication employ some form of message passing or remote procedure call (RPC) in order to keep data coherent, are rather cumbersome, and can be inefficient. Programming ease concerns suggest the need for an abstraction of shared state that is similar in spirit to distributed shared memory (DSM) semantics. However, even the most relaxed DSM coherence model (release consistency [17]) can result in a prohibitively large amount of communication for the type of environment in which data mining may typically be performed. These ∗This work is supported in part by NSF grants EIA–9972881, CCR–9702466, CCR–9705594, and CCR-9988361; and an external research grant from Compaq. †CIS Department, Ohio-State University. Email: srini@cis.ohio-state.edu ‡CS Department, University of Rochester. Email: sandhya@cs.rochester.edu

[1]  Ramesh Subramonian,et al.  A framework for distributed data mining , 1998 .

[2]  Geraldine Fitzpatrick,et al.  Work, Locales and Distributed Social Worlds , 1995, ECSCW.

[3]  Miguel Castro,et al.  Safe and efficient sharing of persistent objects in Thor , 1996, SIGMOD '96.

[4]  Robert D. Logcher,et al.  DICE: An object-oriented programming environment for cooperative engineering design , 1992 .

[5]  L. Devroye A Course in Density Estimation , 1987 .

[6]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[7]  Andrew P. Black,et al.  Fine-grained mobility in the Emerald system , 1987, TOCS.

[8]  Nicholas Carriero,et al.  Matching Language and Hardware for Parallel Computation in the Linda Machine , 1988, IEEE Trans. Computers.

[9]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[10]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[11]  Sanjay Ranka,et al.  An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases , 1997, KDD.

[12]  Srinivasan Parthasarathy,et al.  Clustering Distributed Homogeneous Datasets , 2000, PKDD.

[13]  Krithi Ramamritham,et al.  Maintaining temporal coherency of virtual data warehouses , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[14]  Adam Dingle,et al.  Web Cache Coherence , 1996, Comput. Networks.

[15]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[16]  Liviu Iftode,et al.  Improving release-consistent shared virtual memory using automatic update , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[17]  Srinivasan Parthasarathy,et al.  Incremental and interactive sequence mining , 1999, CIKM '99.

[18]  Srinivasan Parthasarathy,et al.  Memory Placement Techniques for Parallel Association Mining , 1998, KDD.

[19]  Dirk Grunwald,et al.  Improving the cache locality of memory allocation , 1993, PLDI '93.

[20]  M. van Steen,et al.  The Architectural Design of Globe: A Wide-Area Distributed System , 1997 .

[21]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[22]  Brian N. Bershad,et al.  Software write detection for a distributed shared memory , 1994, OSDI '94.

[23]  Srinivasan Parthasarathy,et al.  Active Mining in a Distributed Setting , 1999, Large-Scale Parallel Data Mining.

[24]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[25]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[26]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[27]  Rafael Alonso,et al.  Data caching issues in an information retrieval system , 1990, TODS.

[28]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[29]  M. Frans Kaashoek,et al.  Rover: a toolkit for mobile information access , 1995, SOSP.

[30]  John Riedl,et al.  Toward computer-supported concurrent software engineering , 1993, Computer.

[31]  Michael J. Franklin,et al.  Client Data Caching: A Foundation for High Performance Object Database Systems , 1996 .

[32]  Galen C. Hunt,et al.  Vm-based Shared Memory On Low-latency, Remote-memory-access Networks , 1996, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[33]  Marc Shapiro,et al.  PerDiS ―- a Persistent Distributed Store for Cooperative Applications , 1997 .

[34]  Jessica K. Hodgins,et al.  Temporal notions of synchronization and consistency in Beehive , 1997, SPAA '97.

[35]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[36]  Mitsunori Ogihara,et al.  Active data mining in a distributed setting , 2000 .

[37]  Mitsunori Ogihara,et al.  Clustering Homogeneous Distributed Datasets , 2000 .

[38]  Ouri Wolfson,et al.  Divergence caching in client-server architectures , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[39]  Philip S. Yu,et al.  Online generation of association rules , 1998, Proceedings 14th International Conference on Data Engineering.

[40]  Henri E. Bal,et al.  Orca: A Language For Parallel Programming of Distributed Systems , 1992, IEEE Trans. Software Eng..

[41]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[42]  P.R. Wilson,et al.  Pointer swizzling at page fault time: efficiently and compatibly supporting huge address spaces on standard hardware , 1992, [1992] Proceedings of the Second International Workshop on Object Orientation in Operating Systems.