DOCOMO Euro-Labs, Munich, GermanyMotivation. Peer-to-peer (P2P) systems aremostly deployed in heterogeneous environmentswith resource availability varying not only acrossthe nodes but also over time. If any of the sharedcomputational, storage or network resources areexhausted, failures and delays occur. The com-monly used crash-stop failure model assumes thatonce a node stops sending messages it never againresumes. Such failures are trivially detected andappropriate algorithms are run that maintain theconnectivity and routing efficiency of the P2P over-lay under continuous arrivals and departures of thepeers (i.e. churn) [6], [4].The failure detection mechanisms in the crash-stop model are typically tuned to minimize thenumber of false positives that might be caused byintermittent message dropping or delays. Avoidingthese false positives is important as oversensitivefailure detection triggers more overlay maintenanceevents. Overlay maintenance is costly, not only be-cause the peers need to run the necessary protocolsfor acquiring new neighbors, but also because theapplications (e.g. DHTs) using the overlay need torespond to the failures as well. For these reasonsthe handling of the non-permanent failures cannotsimply be delegated to the overlay maintenance.These failures may significantly affect overlay rout-ing and additional fault-tolerance mechanisms arenecessary.The failure model. The causes of message lossand delays can be numerous. In heterogenous P2Psystems running ever more intensive workloadsnodes may become overloaded [5]. Networks mayexperience transient connectivity problems [3]. Ex-ternal adversaries can mount DDoS attacks [2],while the internal adversaries can take control overa fraction of the peers in the system and disrupt themessage passing protocols [1].In this paper we abstract away from the causes offailures and subsume them in a well defined failuremodel. A fraction of peers are allowed to arbitrarilydelay or drop messages. The drops and delayscan occur in a message-dependent way. However,we forbid message mutation and spurious messageinjection.The protocol. Within the above failure model weaddress the problem of reliable recursive messagerouting in structured overlays. In our ForwardFeedback Protocol (FFP) each routed messageis followed on its routing path by a feedbackmessage. Feedback signals either success or failureof message delivery. Peers accumulate feedback andbased on it adjust their routing decisions. Rout-ing path delays exceeding a timeout and droppedmessages trigger negative feedback, which leads toreadjustment of the paths to route around the peerscausing delays or loss.Each peer locally keeps a set of success esti-mators for each of its neighbors. The success esti-mators are random variables reflecting the historyof the past routing outcomes. When a messagearrives and needs to be forwarded the peer drawssamples from the success estimators. Based onthese samples the peer probabilistically picks thenext hop that maximizes routing success. When thefeedback message subsequently arrives it is usedto update the success estimators. Over time as thepeer is forwarding service requests and receivingfeedback it improves its routing decisions.The proposed FFP protocol has the followingproperties:
[1]
Daniel Stutzbach,et al.
Understanding churn in peer-to-peer networks
,
2006,
IMC '06.
[2]
David Mazières,et al.
Kademlia: A Peer-to-Peer Information System Based on the XOR Metric
,
2002,
IPTPS.
[3]
Ion Stoica,et al.
Non-Transitive Connectivity and DHTs
,
2005,
WORLDS.
[4]
Aleksandar Kuzmanovic,et al.
Denial-of-service resilience in peer-to-peer file sharing systems
,
2005,
SIGMETRICS '05.
[5]
Scott Shenker,et al.
Fixing the Embarrassing Slowness of OpenDHT on PlanetLab
,
2005,
WORLDS.
[6]
David R. Karger,et al.
Chord: A scalable peer-to-peer lookup service for internet applications
,
2001,
SIGCOMM '01.