Cloud computing

We are used to having huge datasets pouring out of high-throughput genome centres, but with the advent of ultra high-throughput sequencing, genotyping and other functional genomics in every laboratory we are facing a scary new era in petabyte scale data. For example, the 1000 genomes’ projects will probably produce about 1 Tb of finished data. To process data, this project required about 100 Tbs of scratch disk. Working at this level, real technical limitations start to hamper progress. One has to consider storage, but not just having enough, but making sure its available to your compute (network), that you have sufficient I/O to do anything in real time. Software language and implementation become critical factors when dealing with terabytes of data. With such high-intensity computing, power (getting enough), cooling, etc. become real issues. How do you let anyone else access the data? Is the data backed up and even if it is how many years would it take to restore from tape? So how will we solve all these technical hurdles? Each of these can be solved with technical knowledge. But you do not want to have to worry about working within these constraints. When working with large datasets, these constraints can continually hamper progress on getting real research done. Whilst one can choose to solve each of these individual problems, the impact of these constrains on the scientific workflow can be considerable. It would be wiser to optimize for productivity. In software development, similar constrains are addressed with abstraction layers. Database access is mediated through relational mapping tools, visualization is aided with powerful graphical packages preventing individual research groups from having to reinvent the wheel. Rails, Eclipse, Processing, Hibernate, Catalyst. Cloud computing offers a similar level of abstraction for many of the constraints encountered when dealing with extremely large (?) datasets. You might have encountered similar ideas when using hosted services such as Google Mail, ManyEyes (http://manyeyes.alphaworks.ibm.com), others. These tools provide

[1]  Geoffrey C. Fox,et al.  High Performance Parallel Computing with Clouds and Cloud Technologies , 2009, CloudComp.

[2]  G. Bruce Berriman,et al.  Scientific workflow applications on Amazon EC2 , 2010, 2009 5th IEEE International Conference on E-Science Workshops.

[3]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.