Você está na página 1de 38

—  Καθηγήτρια Βάνα Καλογεράκη

Τµήµα Πληροφορικής
Οικονοµικό Πανεπιστήµιο Αθηνών

—  Topic 6: Structured P2P Systems - OceanStore

1
—  Oceanstore: An architecture for global-scale persistent
storage
Kubiatowicz et al [Berkeley]

—  http://oceanstore.cs.berkeley.edu

Distributed Systems 2
—  A global-scale utility infrastructure
—  Internet-based, distributed storage system for
information appliances (such as computers, PDAs,
cellular phones) and different levels of connectivity
—  Designed to support 1010 users, each having 104 data
files (Support over 1014 files)

Distributed Systems 3
—  Everyone’s data, one big utility
—  Allows data objects to exist anywhere, at any time

—  Uses automatic replication for disaster recovery


—  Also recovers from server and network failures

—  Achieves performance comparable to LAN-based


networked storage systems
—  Services would be provided by companies and users would
pay fees to utility providers to consume storage and
bandwidth resources

Distributed Systems 4
Distributed Systems 5
—  Built on a fundamentally untrusted infrastructure
—  Servers may crash without warning
—  Information can be leaked to third parties
—  Support for nomadic data
—  Data can be cached anywhere, anytime (promiscuous
caching)
—  Data is separated from its physical location

Distributed Systems 6
—  Naming
—  Access Control
—  Data Location and Routing
—  Local searches
—  Global searches
—  Data Replication

Distributed Systems 7
—  The fundamental unit is the persistent object
—  Objects identified by globally unique identifiers (GUIDs)
—  Pseudo-random fixed-length bit string
—  All operations operate on GUIDs
—  System-level names should help to authenticate data for a more accessible naming
facility
—  Objects are replicated and stored on multiple servers for
availability (floating replicas)
—  Two types of objects
—  Active object: latest version of the data
—  Archival object: permanent, read-only version

—  Objects are modified through user-generated updates (e.g.,


read/write operations)
—  Level of consistency can range from loose to strict consistency semantics

Distributed Systems 8
GUID

GUID

Distributed Systems 9
—  Namespaces (nodes & objects)
—  An object GUID is the secure hash (160-bit SHA-1) of the owner’s
public key + human readable name
—  160 bits -> 280 names before name collision
—  Based on the SHA-1 secure hash function
—  Certain objects act as directories mapping names to GUIDs
—  Each object has its own hierarchy rooted as “root”
—  Properties
—  Uniqueness: GUID space selected by public key
—  Verification: check signatures with public key
—  Replicas of the same object have the same GUID

Distributed Systems 10
—  Reader restriction
—  Encrypt all data that is not public
—  Distribute the encryption key to those users with read permissions
—  To revoke read permissions, all replicas must be deleted

—  Writer restriction


—  Access control list (ACL) for each object
—  Owner of the object keeps the ACL
—  All writes signed and verified against the ACL

Distributed Systems 11
—  Naming
—  Access Control
—  Data Location and Routing
—  Local searches
—  Global searches
—  Data Replication

Distributed Systems 12
—  Objective: locate data quickly regardless of its location
—  In OceanStore, entities that are accessed frequently are likely to
reside close to where they are being used
—  Approach: combined location and routing
—  Two-level routing:
—  First, use a fast probabilistic algorithm to route to objects in the
local vicinity (attenuated bloom filters)
—  Objects that are frequently accessed reside close to where they are used
—  If needed, use a slower but reliable large-scale hierarchical data
structure to locate remote objects (tapestry)

Distributed Systems 13
—  Each node has a set of neighbors (peers)
—  The node associates with each neighbor a probability
of finding the object in the system through that
neighbor
—  If the query cannot be satisfied locally, route the query
to the most likely neighbor that can answer it
—  This function is implemented using attenuated Bloom
filters

Distributed Systems 14
—  An efficient, lossy way of describing sets of
data
—  A Bloom Filter data structure represents the h1 0
set of data by using an array of bits; each bit h2 1
takes a binary one or a zero value
—  A Bloom filter is a bit-vector of length m
associated with a family of k hash functions m
—  Each hash function maps the elements of
the set to an integer in [0,m)
—  To form a representation of a set, each set hk-1 0
element is hashed and the bits in the vector hk 1
corresponding to the hash functions’
results are set
n: number of objects,
m: Bloom Filter width
k: number of hash functions

Distributed Systems 15
—  To check if an object is in the set
—  Object is hashed
—  Corresponding bits of the filter are checked
—  If at least one of the bits is not set, object not in the set
—  If all bits are set, object may be in the set
—  The object may not be in the set if all the hashed bits are
set (false positive)
—  The number of false positives for a Bloom filter is a
function of its width, the number of hash functions and
the cardinality of the objects in the set

Distributed Systems 16
A Bloom Filter: To check an object’s name against a Bloom filter
summary, the name is hashed with n different hash functions
(here, n=3) and bits corresponding to the result are checked
Distributed Systems 17
—  An attenuated Bloom filter of depth D is an array of D
normal bloom filters
—  The first bloom filter is set of the objects contained
locally at the current node
—  The i th bloom filter is the union of all the bloom filters
for all of the nodes a distance i from the current node
—  For each neighbor link, an attenuated Bloom filter is
kept

Distributed Systems 18
—  The node examines the 1st level of each of its
neighbors’ filters
—  If matches are found, the query is forwarded to the
closest neighbor
—  If no matches are found, the node examines the next
level of each filter and forwards the query if a match is
found

Distributed Systems 19
5
11010

n3 X
4b
11100 11010 11010
Looking for 2
11011
11010

1 n1 n2
3
10101 11100

4a
00011
n4
local objects
00011
bloom filters
Distributed Systems 20
—  When to update a Bloom Filter?
—  When new objects are stored or objects are deleted
—  The node calculates the changed bits in its bloom filter and
sends those to its neighbors
—  Each neighbor that receives such a message updates the bits
in the attenuated bloom filter and these changes are sent out
as well
—  The update may be propagated to some servers more
than once, wasting network bandwidth

Distributed Systems 21
—  Counting Bloom Filters provide an efficient way to
implement the delete operation on the bloom filter
—  In a counting filter, the array positions are extended from
being a single bit, to being a counter. The counter counts
the number of times the corresponding bit is set in the
bloom filter.
—  The insert operation increments the value of the counter
—  The delete operation decrements the value of the counter

Distributed Systems 22
—  In a P2P system, messages can reach peers through
multiple paths


—  Employ filtering techniques
—  Destination filtering: peers remember the ids of every
update they see for a short period, therefore ignore
subsequent arrivals of the same update through different
paths
—  Source filtering: when a node receive a duplicate
message from one of its neighbors, it sends a message to
that neighbor to stop forward updates through that path

Distributed Systems 23
—  Uses Tapestry to build a reliable
large-scale
hierarchical data structure to locate remote objects
—  A query is routed from node to node until the location of
a replica is discovered

Distributed Systems 24
—  Each node is assigned a random and unique
node-id

—  Node ids are used to construct a mesh of


neighbor links

Distributed Systems 25
—  Every node is connected to other nodes via neighbor links
of various levels
—  Level-1 edges connect to nodes with different values in the lowest
digit of their node IDs
—  Level-2 edges connect to nodes that match in the lowest digit and
have different values in the second digit
—  etc.

—  Each node has a neighbor map with multiple levels


—  Messages are routed to the destination, digit by digit
***8 -> **98 -> *598 -> 4598

Distributed Systems 26
Distributed Systems 27
—  Incremental suffix based B4F8 L1
routing 0325 3E98
—  Each link is labeled with a L2
L3
level number that denotes 0098
the stage of routing that uses 9098 2BB8 L3
the link L2 1598
L4
—  At hth hop arrive at nearest L3
node hop(h): hop(h) shares 4598
suffix with B of length h 7598 L4
L4
digits L2 D598
L1
87CA 2118

Potential path for a message originating


at node 0325 destined for node 4598
Distributed Systems 28
—  Each object with a GUID is mapped to a root node id
—  Search walks toward root until the object or a pointer to the
object is located
—  OceanStore enhancements for reliability:
—  Each object is associated with a set of unique root nodes

—  To advertise an object, the node sends a message towards


the object’s root, leaving pointers on the way

Distributed Systems 29
Distributed Systems 30
—  To locate the object, the node sends a message
towards the object’s root, until it finds a pointer in
which case it routes directly to the object

Distributed Systems 31
—  Tapestry uses redundant neighbor pointers when it
detects a primary route failure
—  Uses periodic probes to check link conditions
—  Tapestry deterministically chooses multiple root nodes
for each object

Distributed Systems 32
—  Automatic repair
—  Node insertions:
—  A new node needs the address of at least one existing node
—  It then starts advertising its services and the roles it can
assume to the system through the existing node
—  Exiting nodes:
—  If possible, the exiting node runs a shutdown script to inform
the system
—  In any case, neighbors will detect its absence and update
routing tables accordingly

Distributed Systems 33
—  Active Data in Object Replicas
—  Latest version of the object
—  State logging for updates and conflict resolutions

—  Archival Objects


—  Permanent, read-only version of the object

Distributed Systems 34
—  Updates are made by clients and all updates are logged
—  OceanStore allows concurrent updates
—  Serializing updates:
—  Since the infrastructure is untrusted, using a master
replica will not work
—  Instead, a group of peers called inner ring is responsible
for choosing final commit order

Distributed Systems 35
—  An object’s inner ring is responsible to
—  Generate new versions of an object from client updates
—  Generate encoded, archival fragments and distribute
them
—  Provide mappings from active GUID to the GUID of
most recent version of the object
—  Verify a data object’s legitimate writers
—  Maintain an update history providing an undo
mechanism

Distributed Systems 36
—  Three forms of data: object replicas, secondary
replicas and archival fragments
Primary Replica
(Inner ring) Archive

Secondary
Replicas

Treq Tagree Tdisseminate


Distributed Systems 37
—  Oceanstore: an architecture for global-scale
persistent storage [Kubiatowicz et al, ASPLOS
2000]

—  Probabilistic location and routing [Rhea and


Kubiatowicz, INFOCOM 2002]

Distributed Systems 38

Você também pode gostar