INTERACTING WITH DATA USING THE FILEHASH PACKAGE FOR R

The filehash package for R implements a simple key-value style database where character string keys are associated with data values that are stored on the disk. A simple interface is provided for inserting, retrieving, and deleting data from the database. Utilities are provided that allow filehash databases to be treated much like environments and lists are already used in R. These utilities are provided to encourage interactive and exploratory analysis on large datasets. Three different file formats for representing the database are currently available and new formats can easily be incorporated by third parties for use in the filehash framework. 1 Overview and Motivation Working with large datasets in R can be cumbersome because of the need to keep objects in physical memory. While many might generally see that as a feature of the system, the need to keep whole objects in memory creates challenges to those who might want to work interactively with large datasets. Here we take a simple definition of “large dataset” to be any dataset that cannot be loaded into R as a single R object because of memory limitations. For example, a very large data frame might be too large for all of the columns and rows to be loaded at once. In such a situation, one might load only a subset of the rows or columns, if that is possible. In a key-value database, an arbitrary data object (a “value”) has a “key” associated with it, usually a character string. When one requests the value associated with a particular key, it is the database’s job to match up the key with the correct value and return the value to the requester. The most straightforward example of a key-value database in R is the global environment. Every object in R has a name and a value associated with it. When you execute at the R prompt > x print(x) the first line assigns the value 1 to the name/key “x”. The second line requests the value of “x” and prints out 1 to the console. R handles the task of finding the appropriate value for “x” by searching through a series of environments, including the namespaces of the packages on the search list. In most cases, R stores the values associated with keys in memory, so that the value of x in the example above was stored in and retrieved from physical memory. However, the idea of a key-value database can be generalized beyond this particular configuration. For example, as of R 2.0.0, much of the R code for R packages is stored in a lazy-loaded database, where the values are initially stored on disk and loaded into memory on first access [Rip04]. Hence, when R starts up, it uses relatively little memory, while the memory usage increases as more objects are requested. Data could also be stored on other computers (e.g. websites) and retrieved over the network. The general S language concept of a database is described in Chapter 5 of the Green Book [Cha98] and earlier in [Cha91]. Although the S and R languages have different semantics with respect to how variable names are looked up and bound to values, the general concept of using a key-value database applies to both languages. Duncan Temple Lang has implemented this general database framework for R in the RObjectTables package of Omegahat [Tem02]. The RObjectTables package provides an interface for connecting R with arbitrary backend systems, allowing data values to be stored in potentially any format or location.