Cooperative caching and prefetching in parallel/distributed file systems

If we examine the structure of the applications that run on parallel machines, we observe that their I/O needs increase tremendously every day. These applications work with very large data sets which, in most cases, do not t in memory and have to be kept in the disk. The input and output data les are also very large and have to be accessed very fast. These large applications also want to be able to checkpoint themselves without wasting too much time. These facts constantly increase the expectations placed on parallel and distributed le systems. Thus, these le systems have to improve their performance to avoid becoming the bottleneck in parallel/distributed environments. On the other hand, while the performance of the new processors, interconnection networks and memory increases very rapidly, no such thing happens with the disk performance. This lack of improvement is due to the mechanical parts used to build the disks. These components are slow and limit both the latency and the bandwidth of the disk. Thus, the performance of the le-system operations that have to access the disk is also limited. As the mechanical technology cannot keep the pace of the electronic one, the research community is in search of solutions that improve the le-system performance without expecting a signi cant improvement in the mechanical devices. In this thesis, we propose a solution to this problem by decoupling the performance of the le system from the performance of the disk. We achieve it by designing a new cooperative cache and some aggressive-prefetching algorithms. Both mechanisms, decrease the number of times the le system has to access the slow disk in the critical path of the user request. Furthermore, the resources used in these solutions are large memories and high-speed interconnection networks which grow at a similar pace as the rest of the components in a parallel machine. vii viii Cooperative Caching and Prefetching We propose a new approach to design cooperative caches. This kind of caches have traditionally based their performance on achieving a high physical locality. This means that the system tried to cache le blocks in the nodes were they were going to be used. Our thesis is that high-performance cooperative caches can be implemented without exploiting this locality. We will prove that focusing the design on the global e ectiveness of the cache, the speed of remote operations and the simpli cation of the coherence mechanism ends up in a better and simpler cooperative cache. We also propose a mechanism that converts traditional prefetching algorithms in aggressive ones that take advantage of the large sizes of the cooperative cache. In this work, we have designed two cooperative caches using the proposed new approach. The rst one has a centralized control and its main objective was to increase our knowledge about cooperative caches. With this design, we also wanted to prove that a centralized control is not a bad idea for small networks with tenths of nodes. The second design is a distributed one and also includes a fault tolerance mechanism. In this second design, we prove that the new approach proposed in this thesis really achieves high-performance cooperative caches. All the results presented in this thesis have been obtained through simulations so that a wide range of architectures can be studied. Furthermore, we have also used a new simulation methodology that allows more accurate simulations than the traditional one. To test this work, we have compared our proposals with the state of the art in cooperative caches and parallel/distributed le systems. This comparison has be done under two di erent environments. The rst one characterizes a parallel machine and the workload used is the one described in the CHARISMA project. The second one emulates a network of workstations using the Sprite trace les. ACKNOWLEDGMENTS I would like to thank Jes us Labarta, my thesis advisor, for his guidance of this dissertation and his support. He has taught me all I know about researching and his patience with me has been in nite (unfortunately, not like his free time). I am specially grateful to Sergi Girona for many insightful discussions in the rst stage of this work. I also want to thank him for his patience reading my manuscripts and making interesting comments which I did not always follow. Finally, I am in debt with him for all the modi cations, and new features implemented in DIMEMAS without which this work could have never been done. My gratitude also to Maite Ortega for her encouragement throughout all the time I have been working on this thesis. I also want to thank her for implementing the rst prototype of the cooperative cache described in this work. Among all the teachers I've had during my academic life, I am in debt with Nacho Navarro, Jordi Torres and Teo Jov e because they were the rst ones to show me what an operating system was. Since then, operating systems have become one of my greatest interests. Furthermore, I also have to thank Nacho Navarro for guiding my rst steps in the task of researching. I have a tendency to get very excited with a successful result and to get very upset when the results are not as successful. Roger Espasa has been the person who has always tried to balance my mood so that bad moments are not so bad. I thank him for this, although I am not sure about his success. I also thank him for being an excellent o ce-mate during these ve years and for helping me every time I needed a counsel. ix x Cooperative Caching and Prefetching I also want to thank the system administrators from both LCAC and CEPBA and specially to Oriol Riu, Judit Gim enez and Rosa Castro for their excellent technical support. I also want to thank all of them for those wonderful and relaxing co ee breaks and for our weekly lunches. There have been two very special friends, Manuel Torralba and Geni Panad es, that have encouraged me during all the time I have been working on this thesis. I thank Manuel for all the exciting and very useful discussions we have had in the last years. I also thank him for all the wonderful moments we have shared together. To Geni, I have to thank her for being such a good friend and for always being ready to listen to me (by the way, did I ever tell you what this thesis is about?). I would like to thank my parents, Jaume and Maria Llu sa, and my brother Kikus for their love and support along these years. Their dedication and encouragement has made possible this work. I also thank them for believing that I was going to nish this work more than I did at some times. There is a person which has su ered the worst part of working on a thesis: bad moods, no free time, etc. Gl oria has bared all of these without complaining (or at least, not too often). Furthermore, she has given me all the support I needed in all those bad moments when you really need somebody. For all of this, and for many other things I have no space to write down here, thanks a lot. Finally, I'd like to thank Xavi Martorell and Yolanda Becerra for reading this document and making many interesting comments. This work was supported in part by the Ministry of Education of Spain under con tracts TIC 537/94 and TIC 0429/95 and by the CEPBA (European Center for Parallelism of Barcelona). CONTENTS LIST OF FIGURES xvii LIST OF TABLES xxii