Escolar Documentos
Profissional Documentos
Cultura Documentos
http://www.scs.cs.nyu.edu/coral/
1
The DSHT abstraction is specifically suited to locat- in a normal hash table. Only one value can be stored
ing replicated resources. DSHTs sacrifice the consis- under a key at any given time. DHTs assume that
tency of DHTs to support both frequent fetches and fre- these keys are uniformly distributed in order to balance
quent stores of the same hash table key. The fundamen- load among participating nodes. Additionally, DHTs
tal observation is that a node doesn’t need to know every typically replicate popular key/value pairs after multiple
replicated location of a resource—it only needs a single, get requests for the same key.
valid, nearby copy. Thus, a sloppy insert is akin to an In order to determine where to insert or retrieve
append in which a replica pointer appended to a “full” a key, an underlying lookup protocol assigns each
node spills over to the previous node in the lookup path. node an m-bit nodeid identifier and supplies an RPC
A sloppy retrieve only returns some randomized subset find closer node(key). A node receiving such an RPC
of the pointers stored under a given key. returns, when possible, contact information for another
In order to restrict queries to nearby machines, each a node whose nodeid is closer to the target key. Some
Coral node is a member of several DSHTs, which we systems [5] return a set of such nodes to improve per-
call clusters, of increasing network diameter. The di- formance; for simplicity, we hereafter refer only to the
ameter of a cluster is the maximum desired round-trip single node case. By iterating calls to find closer node,
time between any two nodes it contains. When data is we can map a key to some closest node, which in most
cached somewhere in a Coral cluster, any member of the DHTs will require an expected O(log n) RPCs. This
cluster can locate a copy without querying machines far- O(log n) number of RPCs is also reflected in nodes’
ther away than the cluster’s diameter. Since nodes have routing tables, and thus provides a rough estimate of to-
the same identifiers in all clusters, even when data is not tal network size, which Coral exploits as described later.
available in a low-diameter cluster, the routing informa- DHTs are well-suited for keys with a single writer and
tion returned by the lookup can be used to continue the multiple readers. Unfortunately, file-sharing and web-
query in a larger-diameter cluster. caching systems have multiple readers and writers. As
Note that some DHTs replicate data along the last discussed in the introduction, a plain hash table is the
few hops of the lookup path, which increases the avail- wrong abstraction for such applications.
ability of popular data and improves performance in the A DSHT provides a similar interface to a DHT, except
face of many readers. Unfortunately, even with locality- that a key may have multiple values: put(key, value)
optimized routing, the last few hops of a lookup are pre- stores a value under key, and get(key) need only return
cisely the ones that can least be optimized. Thus, with- some subset of the values stored. Each node stores only
out a clustering mechanism, even replication does not some maximum number of values for a particular key.
avoid the need to query distant nodes. Perhaps more When the number of values exceeds this maximum, they
importantly, when storing pointers in a DHT, nothing are spread across multiple nodes. Thus multiple stores
guarantees that a node storing a pointer is near the node on the same key will not overload any one node. In con-
pointed to. In contrast, this property follows naturally trast, DHTs replicate the exact same data everywhere;
from the use of clusters. many people storing the same key will all contact the
Coral’s challenge is to organize and manage these same closest node, even while replicas are pushed back
clusters in a decentralized manner. As described in the out into the network from this overloaded node.
next section, the DSHT interface itself is well-suited for More concretely, Coral manages values as fol-
locating and evaluating nearby clusters. lows. When a node stores data locally, it inserts
a pointer to that data into the DSHT by executing
2 Design put(key, nodeaddr ). For example, the key in a distri-
buted web cache would be hash(U RL). The insert-
This section first discusses Coral’s DSHT storage layer
ing node calls find closer node(key) until it locates
and its lookup protocols. Second, it describes Coral’s
the first node whose list stored under key is full, or it
technique for forming and managing clusters.
reaches the node closest to key. If this located node is
full, we backtrack one hop on the lookup path. This
2.1 A sloppy storage layer
target node appends nodeaddr with a timestamp to the
A traditional DHT exposes two functions. (possibly new) list stored under key. We expect records
put(key, value) stores a value at the specified m- to expire quickly enough to keep the fraction of stale
bit key, and get(key) returns this stored value, just as pointers below 50%.
2
A get(key) operation traverses the identifier space 11...11 160−bit id space 00...00
3
“spilling-over” when full helps to balance load while 4
3
The Task Coral’s Solution
Coral nodes insert their own contact information and Internet topology
Discovering and joining a low-level cluster, hints into higher-level clusters. Nodes reply to unexpected requests
while only requiring knowledge of some other with their cluster information. Sloppiness in the DSHT infrastructure
node, not necessarily a close one. prevents hotspots from forming when nodes search for new clusters and
test random subsets of nodes for acceptable RTT thresholds. Hotspots
would otherwise distort RTT measurements and reduce scalability.
Merging close clusters into the same name- Coral’s use of cluster size and age information ensures a clear, stable
space without experiencing oscillatory behav- direction of flow between merging clusters. Merging may be initiated
ior between the merging clusters. as the byproduct of a lookup to a node that has switched clusters.
Splitting slow clusters into disjoint subsets, in Coral’s definition of a cluster center provides a stable point about which
a manner that results in an acceptable and sta- to separate nodes. DSHT sloppiness prevents hotspots while a node
ble partitioning without causing hotspots. determines its relative distance to this known point.
nodes in the cluster, perhaps by simply looking up its quests originating outside its current cluster with the
own identifier as a natural part of joining. tuple{cid i , size i , ctime i }, where size i is the estimated
As in any peer-to-peer system, a node must initially number of nodes in the cluster, and ctimei is the clus-
learn about some other Coral node to join the system. ter’s creation time. Thus, nodes from the old cluster
However, Coral adds a RTT requirement for a node’s will learn of this new cluster that has more nodes and
lower-level clusters. A node unable to find an acceptable the same diameter. This produces an avalanche effect as
cluster creates a new one with a random cid . A node can more and more nodes switch to the larger cluster.
join a better cluster whenever it learns of one. Unfortunately, Coral can only count on a rough ap-
Several mechanisms could have been used to dis- proximation of cluster size. If nearby clusters A and B
cover clusters, including using IP multicast or merely are of similar sizes, inaccurate estimations could in the
waiting for nodes to learn about clusters as a side ef- worst case cause oscillations as nodes flow back-and-
fect of normal lookups. However, Coral exploits the forth. To perturb such oscillations into a stable state,
DSHT interface to let nodes find nearby clusters. Upon Coral employs a preference function δ that shifts every
joining a low-level cluster, a node inserts itself into hour. A node selects the larger cluster only if the fol-
its higher-level clusters, keyed under the IP addresses lowing holds:
of its gateway routers, discovered by traceroute.
For each of the first five routers returned, it executes log(size A ) − log(size B ) > δ (min(age A , age B ))
put(hash(router .ip), nodeaddr ). A new node, search-
ing for a low-level acceptable cluster, can perform a get where age is the current time minus ctime. Otherwise,
on each of its own gateway routers to learn some set of a node simply selects the cluster with the lower cid .
topologically-close nodes. We use a square wave function for δ that takes a value
0 on an even number of hours and 2 on an odd num-
2.4 Merging clusters ber. For clusters of disproportionate size, the selection
While a small cluster diameter provides fast lookup, a function immediately favors the larger cluster. However,
large cluster capacity increases the hit rate in a lower- should clusters of similar size continuously exchange
level DSHT. Therefore, Coral’s join mechanism for members when δ is zero, as soon as δ transitions, nodes
individual nodes automatically results in close clusters will all flow to the cluster with the lower cid . Should the
merging if nodes in both clusters would find either ac- clusters oscillate when δ = 2, the one 22 -times larger
ceptable. This merge happens in a totally decentral- will get all members when δ returns to zero.
ized way, without any expensive agreement or leader-
2.5 Splitting clusters
election protocol. When a node knows of two accept-
able clusters at a given level, it will join the larger one. In order to remain acceptable to its nodes, a cluster may
When a node switches clusters, it still remains in the eventually need to split. This event may result from a
routing tables of nodes in its old cluster. Old neigh- network partition or from population over-expansion, as
bors will still contact it; the node replies to level-i re- new nodes may push the RTT threshold. Coral’s split
4
operation again incorporates some preferred direction of 1
NYU
1
Nortel
flow. If nodes merely atomized and randomly re-merged 0.8 0.8
est to key cid N in the DSHT. However, nodes cannot 0.2 0.2
0 0
node joins cid i based on its RTT with the center, and it 0 50 100 150 200 250 300 0 50 100 150 200 250 300
round trip time (msec) round trip time (msec)
performs a put(cid i , nodeaddr ) on its old cluster and
its higher-level DSHTs. Figure 2: CDFs of round-trip times between specified RON
One concern is that an early-adopter may move into a nodes and Gnutella peers.
small successor cluster. However, before it left its pre-
vious level-i cluster, the latency within this cluster was
approaching that of the larger level-(i−1) cluster. Thus, To measure network distances in a deployed system,
the node actually gains little benefit from maintaining we performed latency experiments on the Gnutella net-
membership in the smaller lower-level cluster. work. We collected host addresses while acting as a
As more nodes transition, their gets begin to hit the Gnutella peer, and then measured the RTT between 12
sloppy replicas of cid N and cid F : They learn a ran- RON nodes and approximately 2000 of these Gnutella
dom subset of the nodes already split off into the two peers. Both operations lasted for 24 hours. We deter-
new clusters. Any node that finds cluster cidN accept- mined round-trip times by attempting to open several
able will join it, without having needed to ping the old TCP connections to high ports and measuring the mini-
cluster center. Nodes that do not find cidN acceptable mum time elapsed between the SYN and RST packets.
will attempt to join cluster cidF . However, cluster cid F Figure 2 shows the cumulative distribution function
could be even worse than the previous cluster, in which (CDF) of the measured RTT’s between Gnutella hosts
case it will split again. Except in the case of pathologi- and the following RON sites: New York University
cal network topologies, a small number of splits should (NYU); Nortel Networks, Montreal (Nortel); Intel re-
suffice to reach a stable state. (Otherwise, after some search, Berkeley (Intel); KAIST, Daejon (South Korea);
maximum number of unsuccessful splits, a node could Vrije University (Amsterdam); and NTUA (Athens).
simply form a new cluster with a random ID as before.) If the CDFs had multiple “plateaus” at different
RTT’s, system-wide thresholds would not be ideal. A
3 Measurements threshold chosen to fall within the plateau of some set of
nodes sets the cluster’s most natural size. However, this
Coral assigns system-wide RTT thresholds to the differ- threshold could bisect the rising edge of other nodes’
ent levels of clusters. If nodes otherwise choose their CDFs and yield greater instability for them.
own “acceptability” levels, clusters would experience Instead, our measurements show that the CDF curves
greater instability as individual thresholds differ. Also, a are rather smooth. Therefore, we have relative freedom
cluster would not experience a distinct merging or split- in setting cluster thresholds to ensure that each level of
ting period that helps to return it to an acceptable, stable cluster in a particular region can capture some expected
state. Can we find sensible system-wide parameters? percentages of nearby nodes.
5
Our choice of 30 msec for level-2 covers smaller clus- Grimm, Sameer Ajmani, and Rodrigo Rodrigues for
ters of nodes, while the level-1 threshold of 100 msec helpful comments. This research was conducted
spans continents. For example, the expected RTT be- as part of the IRIS project (http://project-
tween New York and Berkeley is 68 msec, and 72 msec iris.net/), supported by the NSF under Coopera-
between Amsterdam and Athens. The curves in Figure 2 tive Agreement No. ANI-0225660. Michael Freedman
suggest that most Gnutella peers reside in North Amer- was supported by the ONR under an NDSEG Fellow-
ica. Thus, low-level clusters are especially useful for ship. This paper is hereby placed in the public domain.
sparse regions like Korea, where most queries of a tradi-
tional peer-to-peer system would go to North America. References
[1] Yan Chen, Randy H. Katz, and John D. Kubiatowicz. SCAN:
4 Related work A dynamic, scalable, and efficient content distribution net-
work. In Proceedings of the International Conference on Per-
Several projects have recently considered peer-to-peer vasive Computing, Zurich, Switzerland, August 2002.
systems for web traffic. Stading et. al. [10] uses a DHT [2] Frank Dabek, M. Frans Kaashoek, David Karger, Robert Mor-
to cache replicas, and PROOFS [11] uses a randomized ris, and Ion Stoica. Wide-area cooperative storage with CFS.
overlay to distribute popular content. However, both In Proceedings of the 18th ACM Symposium on Operating Sys-
systems focus on mitigating flash crowds, not on nor- tems Principles (SOSP ’01), Banff, Canada, 2001.
mal web caching. Therefore, they accept higher lookup [3] John Kubiatowicz et. al. OceanStore: An architecture for
global-scale persistent storage. In Proc. ASPLOS, Cambridge,
costs to prevent hot spots. Squirrel [4] proposed web MA, Nov 2000.
caching on a traditional DHT, although only for LANs. [4] Sitaram Iyer, Antony Rowstron, and Peter Druschel. Squir-
It examines storing pointers in the DHT, yet reports poor rel: A decentralized, peer-to-peer web cache. In Principles of
load-balancing. We attribute this result to the limited Distributed Computing (PODC), Monterey, CA, July 2002.
number of pointers stored (only 4), which perhaps is due [5] Petar Maymounkov and David Mazières. Kademlia: A peer-
to the lack of any sloppiness in the system’s DHT inter- to-peer information system based on the xor metric. In Pro-
ceedings of the 1st International Workshop on Peer-to-Peer
face. SCAN [1] examined replication policies for data Systems (IPTPS02), Cambridge, MA, March 2002.
disseminated through a multicast tree from a DHT de- [6] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp,
ployed at ISPs. and Scott Shenker. A scalable content-addressable network. In
Proc. ACM SIGCOMM, San Diego, CA, Aug 2001.
5 Conclusions [7] Antony Rowstron and Peter Druschel. Pastry: Scalable, distri-
buted object location and routing for large-scale peer-to-peer
Coral introduces the following techniques to enable systems. In Proc. IFIP/ACM Middleware, November 2001.
distance-optimized object lookup and retrieval. First, [8] Antony Rowstron and Peter Druschel. Storage management
Coral provides a DSHT abstraction. Instead of storing and caching in PAST, a large-scale, persistent peer-to-peer
actual data, the system stores weakly-consistent lists of storage utility. In Proc. 18th ACM Symposium on Operating
Systems Principles (SOSP ’01), Banff, Canada, 2001.
pointers that index nodes at which the data resides. Sec-
[9] Subhabrata Sen and Jia Wang. Analyzing peer-to-peer traf-
ond, Coral assigns round-trip-time thresholds to clus- fic across large networks. In Proc. ACM SIGCOMM Internet
ters to bound cluster diameter and ensure fast lookups. Measurement Workshop, Marseille, France, November 2002.
Third, Coral nodes maintain the same identifier in all [10] Tyron Stading, Petros Maniatis, and Mary Baker. Peer-to-
clusters. Thus, even when a low-diameter lookup fails, peer caching schemes to address flash crowds. In Proceed-
Coral uses the returned routing information to continue ings of the 1st International Workshop on Peer-to-Peer Sys-
tems (IPTPS02), Cambridge, MA, March 2002.
the query efficiently in a larger-diameter cluster. Finally,
[11] Angelos Stavrou, Dan Rubenstein, and Sambit Sahu. A
Coral provides an algorithm for self-organizing merging
lightweight, robust p2p system to handle flash crowds. In IEEE
and splitting to ensure acceptable cluster diameters. International Conference on Network Protocol (ICNP), Paris,
Coral is a promising design for performance-driven France, November 2002.
applications. We are in the process of building Coral [12] Ion Stoica, Robert Morris, David Liben-Nowell, David R.
and planning network-wide measurements to examine Karger, M. Frans Kaashoek, Frank Dabek, and Hari Balakr-
ishnan. Chord: A scalable peer-to-peer lookup protocol for in-
the effectiveness of its hierarchical DSHT design.
ternet applications. In IEEE/ACM Trans. on Networking, 2002.
[13] Ben Zhao, John Kubiatowicz, and Anthony Joseph. Tapestry:
Acknowledgments An infrastructure for fault-tolerant wide-area location and
We thank David Andersen for access to the RON routing. Technical Report UCB/CSD-01-1141, Computer Sci-
ence Division, U.C. Berkeley, April 2000.
testbed, and Vijay Karamcheti, Eric Freudenthal, Robert