Efficient storage of data in cloud computing

Acknowledgements This thesis would have not been written without the help of several people, whom I would like to give my appreciation. First of all, to my advisor, Prof. José Orlando Pereira, for his guidance that was essential to accomplish my master thesis. Also, To Prof. Rui Oliveira for our valuable conversations that came to be extremely important for the denition of the path of my work. To Francisco Maia and Francisco Cruz for all the work and non work discussions that became extremely important for my thesis. To Nuno Carvalho for all his support and ideas. To Ricardo Vilaça, Miguel Matos, Bruno Costa and Ana Nunes for their help. To Paula for helping me with the revision of my thesis document and for all the support, even in the most stressing times. Finally, to my parents and my brother for supporting me throughout all my work and my life. Abstract Keeping critical data safe and accessible from several locations has become a global preoccupation, either being this data personal, organizational or from applications. As a consequence of this issue, we verify the emergence of on-line storage services. In addition, there is the new paradigm of Cloud Computing, which brings new ideas to build services that allow users to store their data and run their applications in the Cloud. By doing a smart and ecient management of these services' storage, it is possible to improve the quality of service oered, as well as to optimize the usage of the infrastructure where the services run. This management is even more critical and complex when the infrastructure is composed by thousand of nodes running several virtual machines and sharing the same storage. The elimination of redundant data at these services' storage can be used to simplify and enhance this management. This dissertation presents a solution to detect and eliminate duplicated data between virtual machines that run on the same physical host and write their virtual disks' data to a shared storage. A prototype that implements this solution is introduced and evaluated. Finally, a study that compares the eciency of two dierent approaches used to eliminate redundant data in a personal data set is described.

[1]  Tim Kraska,et al.  Building a database on S3 , 2008, SIGMOD Conference.

[2]  Bruce Eckel Thinking in Java , 1998 .

[3]  Robert P. Goldberg,et al.  Survey of virtual machine research , 1974, Computer.

[4]  Dmitrii Zagorodnov,et al.  Eucalyptus : A Technical Report on an Elastic Utility Computing Archietcture Linking Your Programs to Useful Systems , 2008 .

[5]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[6]  James E. Smith,et al.  The architecture of virtual machines , 2005, Computer.

[7]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[8]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[9]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[10]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[11]  Brian W. Kernighan,et al.  The C Programming Language , 1978 .

[12]  Rajkumar Buyya,et al.  Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[13]  Andrew Warfield,et al.  Parallax: Managing Storage for a Million Machines , 2005, HotOS.

[14]  Dutch T. Meyer,et al.  Parallax: virtual disks for virtual machines , 2008, Eurosys '08.

[15]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[16]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[17]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[18]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[19]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[20]  Steven Hand,et al.  Satori: Enlightened Page Sharing , 2009, USENIX Annual Technical Conference.

[21]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[22]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[23]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.