Escolar Documentos
Profissional Documentos
Cultura Documentos
Τµήµα Πληροφορικής
Οικονοµικό Πανεπιστήµιο Αθηνών
1
Oceanstore: An architecture for global-scale persistent
storage
Kubiatowicz et al [Berkeley]
http://oceanstore.cs.berkeley.edu
Distributed Systems 2
A global-scale utility infrastructure
Internet-based, distributed storage system for
information appliances (such as computers, PDAs,
cellular phones) and different levels of connectivity
Designed to support 1010 users, each having 104 data
files (Support over 1014 files)
Distributed Systems 3
Everyone’s data, one big utility
Allows data objects to exist anywhere, at any time
Distributed Systems 4
Distributed Systems 5
Built on a fundamentally untrusted infrastructure
Servers may crash without warning
Information can be leaked to third parties
Support for nomadic data
Data can be cached anywhere, anytime (promiscuous
caching)
Data is separated from its physical location
Distributed Systems 6
Naming
Access Control
Data Location and Routing
Local searches
Global searches
Data Replication
Distributed Systems 7
The fundamental unit is the persistent object
Objects identified by globally unique identifiers (GUIDs)
Pseudo-random fixed-length bit string
All operations operate on GUIDs
System-level names should help to authenticate data for a more accessible naming
facility
Objects are replicated and stored on multiple servers for
availability (floating replicas)
Two types of objects
Active object: latest version of the data
Archival object: permanent, read-only version
Distributed Systems 8
GUID
GUID
Distributed Systems 9
Namespaces (nodes & objects)
An object GUID is the secure hash (160-bit SHA-1) of the owner’s
public key + human readable name
160 bits -> 280 names before name collision
Based on the SHA-1 secure hash function
Certain objects act as directories mapping names to GUIDs
Each object has its own hierarchy rooted as “root”
Properties
Uniqueness: GUID space selected by public key
Verification: check signatures with public key
Replicas of the same object have the same GUID
Distributed Systems 10
Reader restriction
Encrypt all data that is not public
Distribute the encryption key to those users with read permissions
To revoke read permissions, all replicas must be deleted
Distributed Systems 11
Naming
Access Control
Data Location and Routing
Local searches
Global searches
Data Replication
Distributed Systems 12
Objective: locate data quickly regardless of its location
In OceanStore, entities that are accessed frequently are likely to
reside close to where they are being used
Approach: combined location and routing
Two-level routing:
First, use a fast probabilistic algorithm to route to objects in the
local vicinity (attenuated bloom filters)
Objects that are frequently accessed reside close to where they are used
If needed, use a slower but reliable large-scale hierarchical data
structure to locate remote objects (tapestry)
Distributed Systems 13
Each node has a set of neighbors (peers)
The node associates with each neighbor a probability
of finding the object in the system through that
neighbor
If the query cannot be satisfied locally, route the query
to the most likely neighbor that can answer it
This function is implemented using attenuated Bloom
filters
Distributed Systems 14
An efficient, lossy way of describing sets of
data
A Bloom Filter data structure represents the h1 0
set of data by using an array of bits; each bit h2 1
takes a binary one or a zero value
A Bloom filter is a bit-vector of length m
associated with a family of k hash functions m
Each hash function maps the elements of
the set to an integer in [0,m)
To form a representation of a set, each set hk-1 0
element is hashed and the bits in the vector hk 1
corresponding to the hash functions’
results are set
n: number of objects,
m: Bloom Filter width
k: number of hash functions
Distributed Systems 15
To check if an object is in the set
Object is hashed
Corresponding bits of the filter are checked
If at least one of the bits is not set, object not in the set
If all bits are set, object may be in the set
The object may not be in the set if all the hashed bits are
set (false positive)
The number of false positives for a Bloom filter is a
function of its width, the number of hash functions and
the cardinality of the objects in the set
Distributed Systems 16
A Bloom Filter: To check an object’s name against a Bloom filter
summary, the name is hashed with n different hash functions
(here, n=3) and bits corresponding to the result are checked
Distributed Systems 17
An attenuated Bloom filter of depth D is an array of D
normal bloom filters
The first bloom filter is set of the objects contained
locally at the current node
The i th bloom filter is the union of all the bloom filters
for all of the nodes a distance i from the current node
For each neighbor link, an attenuated Bloom filter is
kept
Distributed Systems 18
The node examines the 1st level of each of its
neighbors’ filters
If matches are found, the query is forwarded to the
closest neighbor
If no matches are found, the node examines the next
level of each filter and forwards the query if a match is
found
Distributed Systems 19
5
11010
n3 X
4b
11100 11010 11010
Looking for 2
11011
11010
1 n1 n2
3
10101 11100
4a
00011
n4
local objects
00011
bloom filters
Distributed Systems 20
When to update a Bloom Filter?
When new objects are stored or objects are deleted
The node calculates the changed bits in its bloom filter and
sends those to its neighbors
Each neighbor that receives such a message updates the bits
in the attenuated bloom filter and these changes are sent out
as well
The update may be propagated to some servers more
than once, wasting network bandwidth
Distributed Systems 21
Counting Bloom Filters provide an efficient way to
implement the delete operation on the bloom filter
In a counting filter, the array positions are extended from
being a single bit, to being a counter. The counter counts
the number of times the corresponding bit is set in the
bloom filter.
The insert operation increments the value of the counter
The delete operation decrements the value of the counter
Distributed Systems 22
In a P2P system, messages can reach peers through
multiple paths
…
Employ filtering techniques
Destination filtering: peers remember the ids of every
update they see for a short period, therefore ignore
subsequent arrivals of the same update through different
paths
Source filtering: when a node receive a duplicate
message from one of its neighbors, it sends a message to
that neighbor to stop forward updates through that path
Distributed Systems 23
Uses Tapestry to build a reliable
large-scale
hierarchical data structure to locate remote objects
A query is routed from node to node until the location of
a replica is discovered
Distributed Systems 24
Each node is assigned a random and unique
node-id
Distributed Systems 25
Every node is connected to other nodes via neighbor links
of various levels
Level-1 edges connect to nodes with different values in the lowest
digit of their node IDs
Level-2 edges connect to nodes that match in the lowest digit and
have different values in the second digit
etc.
Distributed Systems 26
Distributed Systems 27
Incremental suffix based B4F8 L1
routing 0325 3E98
Each link is labeled with a L2
L3
level number that denotes 0098
the stage of routing that uses 9098 2BB8 L3
the link L2 1598
L4
At hth hop arrive at nearest L3
node hop(h): hop(h) shares 4598
suffix with B of length h 7598 L4
L4
digits L2 D598
L1
87CA 2118
Distributed Systems 29
Distributed Systems 30
To locate the object, the node sends a message
towards the object’s root, until it finds a pointer in
which case it routes directly to the object
Distributed Systems 31
Tapestry uses redundant neighbor pointers when it
detects a primary route failure
Uses periodic probes to check link conditions
Tapestry deterministically chooses multiple root nodes
for each object
Distributed Systems 32
Automatic repair
Node insertions:
A new node needs the address of at least one existing node
It then starts advertising its services and the roles it can
assume to the system through the existing node
Exiting nodes:
If possible, the exiting node runs a shutdown script to inform
the system
In any case, neighbors will detect its absence and update
routing tables accordingly
Distributed Systems 33
Active Data in Object Replicas
Latest version of the object
State logging for updates and conflict resolutions
Distributed Systems 34
Updates are made by clients and all updates are logged
OceanStore allows concurrent updates
Serializing updates:
Since the infrastructure is untrusted, using a master
replica will not work
Instead, a group of peers called inner ring is responsible
for choosing final commit order
Distributed Systems 35
An object’s inner ring is responsible to
Generate new versions of an object from client updates
Generate encoded, archival fragments and distribute
them
Provide mappings from active GUID to the GUID of
most recent version of the object
Verify a data object’s legitimate writers
Maintain an update history providing an undo
mechanism
Distributed Systems 36
Three forms of data: object replicas, secondary
replicas and archival fragments
Primary Replica
(Inner ring) Archive
Secondary
Replicas
Distributed Systems 38