Computational challenges in the analysis of large, sparse, spatiotemporal data

The pervasive sources of data in today's networked computing environment provide many innovative opportunities, from mining patterns of individual behavior, to enabling data-intensive approaches for scientific discovery, to supporting new kinds of personal interactions and experiences. Passively collected metadata can also be mined for a variety of social analysis. However, due to the vast size and diversity of these data resources, they can pose serious computational challenges to researchers and analysts. This paper highlights several of the key challenges involved in efficiently collecting, storing, and analyzing datasets consisting of millions of sparse files with spatial, temporal, and network features. We focus on the computational issues faced in analyzing Call Detail Records (CDRs), the metadata (i.e., log files) passively collected by mobile phone operators about transactions on their telecommunications networks. CDRs and related data provide a rich foundation for research in fields ranging from anthropology and sociology to electrical engineering and urban planning. After describing the data and its challenges, we present our current framework for computational analysis, and discuss opportunities for future work.