论文信息 - Elections inaDistributed Computing System

Elections inaDistributed Computing System

After afailure occurs inadistributed computing system, itisoften necessary toreorganize theactive nodes sothat they can continue toperform auseful task. Thefirst step insuch areorgani- zation orreconfiguration istoelect acoordinator nodetomanage the operation. Thispaper discusses suchelections andreorganizations. Twotypes ofreasonable failure environments arestudied. Foreach environment assertions which define themeaning ofanelection are presented. Anelection algorithm whichsatisfies theassertions is presented foreachenvironment. IndexTerms-Crash recovery, distributed computing systems, elections, failures, mutual exclusion, reorganization. I.INTRODUCTION A DISTRIBUTEDsystem isacollection ofautonomous computing nodes whichcancommunicate witheach other andwhich cooperate onacommongoal ortask (4). For example, thegoal maybetoprovide theuser with adatabase management system, andinthis casethedistributed system iscalled adistributed database (16). Whenanodefails orwhenthecommunication subsystem which allows nodes tocommunicate fails, itisusually necessary forthenodes toadapt tothenewconditions sothat they may continue working ontheir joint goal. Forexample, consider acollection ofnodes which areprocessing sensory dataand trying tolocate amoving object (18). Eachnodehassome sensors which provide itwith alocal viewoftheworld. The nodes exchange data andtogether decide where theobject is located. Ifoneofthenodes ceases tooperate, theremaining nodes should recognize this andmodify their strategy forlo- cating theobject. Anode which neighbors thefailed node could trytocollect sensory data fortheareawhich wasassigned to thefailed node. Another alternative would befor theremaining nodes touseadetection algorithm which isnotvery sensitive to"holes" intheareabeing studied. Orthenodes could decide toswitch tosuchanalgorithm whenthefailure occurs. If enough nodes fail, theremaining nodes maydecide that they just cannot perform theassigned task, andmayselect anew orbetter suited jobforthemselves. There areatleast twobasic strategies bywhich adistributed system canadapt tofailures. Onestrategy istohavesoftware which canoperate continuously andcorrectly asfailures occur andarerepaired (9). (Intheprevious example, this would correspond tousing analgorithm which candetect theobject evenwhenthere areholes inthedata.) Thesecond alternative istotemporarily halt normal operation andtotake sometime Manuscript received January 7,1981;revised July 17,1981. This workwas supported inpart bytheNational Science Foundation under Grant ECS-

Hector Garcia-Molina

[1] Elwyn R. Berlekamp,et al. Algebraic coding theory , 1984, McGraw-Hill series in systems science.

[2] Robert Metcalfe,et al. Ethernet: distributed packet switching for local computer networks , 1976, CACM.