Scalable Distributed Virtual Data Structures

Big data stored in scalable, distributed data structures is now popular. We extend the idea to big, virtual data. Big, virtual data is not stored, but materialized a record at a time in the nodes used by a scalable, distributed, virtual data structure spanning thousands of nodes. The necessary cloud infrastructure is now available for general use. The records are used by some big computation that scans every records and retains (or aggregates) only a few based on criteria provided by the client. The client sets a limit to the time the scan takes at each node, for example 10 minutes. We dene here two scalable distributed virtual data structures called VH* and VR*. They use, respectively, hash and range partitioning. While scan speed can dier between nodes, these select the smallest number of nodes necessary to perform the scan in the allotted time R. We show the usefulness of our structures by applying them to the problem of recovering an encryption key and to the classic knapsack problem.