Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase

Huge volumes of data are being accumulated from a variety of sources in engineering and scientific disciplines; this has been referred to as the “Data Avalanche”. Cloud computing infrastructures (such as Amazon Elastic Compute Cloud (EC2)) are specifically designed to combine high compute performance with high performance network capability to meet the needs of data-intensive science. Reliable, scalable, and distributed computing is used extensively on the cloud. Apache Hadoop is one such open-source project that provides a distributed file system to create multiple replicas of data blocks and distribute them on compute nodes throughout a cluster to enable reliable and rapid computations. Column-oriented databases built on Hadoop (such as HBase) along with MapReduce programming paradigm allows development of large-scale distributed computing applications with ease. In this chapter, benchmarking results on a small in-house Hadoop cluster composed of 29 nodes each with 8-core processors is presented along with a case-study on distributed storage of electroencephalogram (EEG) data. Our results indicate that the Hadoop / HBase projects are still in their nascent stages but provide promising performance characteristics with regard to latency and throughput. In future work, we will explore the development of novel machine learning algorithms on this infrastructure.

[1]  Stephen Kent Sloan Digital Sky Survey , 1994 .

[2]  Feng-Hsiung Hsu,et al.  Behind Deep Blue: Building the Computer that Defeated the World Chess Champion , 2002 .

[3]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[4]  Michael J. Cafarella,et al.  Building Nutch: Open Source Search , 2004, ACM Queue.

[5]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[6]  Suman Nath,et al.  Cypress : Managing Massive Time Series Streams with Multi-Scale Compressed Trickles , 2009 .

[7]  Eamonn J. Keogh A decade of progress in indexing and mining large time series databases , 2006, VLDB.

[8]  M V Olson,et al.  The human genome project. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Per Andersson,et al.  Zohmg—A Large Scale Data Store for Aggregated Time-series-based Data , 2009 .

[10]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11]  H. Berger Über das Elektrenkephalogramm des Menschen , 1929, Archiv für Psychiatrie und Nervenkrankheiten.

[12]  H. Berger Über das Elektrenkephalogramm des Menschen , 1933, Archiv für Psychiatrie und Nervenkrankheiten.

[13]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[14]  Benjamin H. Brinkmann,et al.  Large-scale electrophysiology: Acquisition, compression, encryption, and storage of big data , 2009, Journal of Neuroscience Methods.

[15]  GhemawatSanjay,et al.  The Google file system , 2003 .

[16]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Eamonn J. Keogh,et al.  Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[19]  Dimitrios Gunopulos,et al.  Iterative Incremental Clustering of Time Series , 2004, EDBT.

[20]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.

[22]  A. Schulze-Bonhage,et al.  How well can epileptic seizures be predicted? An evaluation of a nonlinear method. , 2003, Brain : a journal of neurology.

[23]  Eamonn J. Keogh Recent Advances in Mining Time Series Data , 2005, PKDD.

[24]  Kirk D. Borne,et al.  Scalable Distributed Change Detection from Astronomy Data Streams Using Local, Asynchronous Eigen Monitoring Algorithms , 2009, SDM.