Ocean Store

  Καθηγήτρια Βάνα Καλογεράκη
Τµήµα Πληροφορικής
Οικονοµικό Πανεπιστήµιο Αθηνών
  Topic 6: Structured P2P Systems - OceanStore
1
  Oceanstore: An architecture for global-scale persistent
storage
Kubiatowicz et al [Berkeley]
  http://oceanstore.cs.berkeley.edu
Distributed Systems 2
  A global-scale utility infrastructure
  Internet-based, distributed storage system for
information appliances (such as computers, PDAs,
cellular phones) and different levels of connectivity
  Designed to support 1010 users, each having 104 data
files (Support over 1014 files)
  Everyone’s data, one big utility
  Allows data objects to exist anywhere, at any time
  Uses automatic replication for disaster recovery

  Also recovers from server and network failures
  Achieves performance comparable to LAN-based

networked storage systems
  Services would be provided by companies and users would
pay fees to utility providers to consume storage and
bandwidth resources
  Built on a fundamentally untrusted infrastructure
  Servers may crash without warning
  Information can be leaked to third parties
  Support for nomadic data
  Data can be cached anywhere, anytime (promiscuous
caching)
  Data is separated from its physical location
  Naming
  Access Control
  Data Location and Routing
  Local searches
  Global searches
  Data Replication
  The fundamental unit is the persistent object
  Objects identified by globally unique identifiers (GUIDs)
  Pseudo-random fixed-length bit string
  All operations operate on GUIDs
  System-level names should help to authenticate data for a more accessible naming
facility
  Objects are replicated and stored on multiple servers for
availability (floating replicas)
  Two types of objects
  Active object: latest version of the data
  Archival object: permanent, read-only version
  Objects are modified through user-generated updates (e.g.,

read/write operations)
  Level of consistency can range from loose to strict consistency semantics
GUID
GUID
  Namespaces (nodes & objects)
  An object GUID is the secure hash (160-bit SHA-1) of the owner’s
public key + human readable name
  160 bits -> 280 names before name collision
  Based on the SHA-1 secure hash function
  Certain objects act as directories mapping names to GUIDs
  Each object has its own hierarchy rooted as “root”
  Properties
  Uniqueness: GUID space selected by public key
  Verification: check signatures with public key
  Replicas of the same object have the same GUID
  Reader restriction
  Encrypt all data that is not public
  Distribute the encryption key to those users with read permissions
  To revoke read permissions, all replicas must be deleted
  Writer restriction

  Access control list (ACL) for each object
  Owner of the object keeps the ACL
  All writes signed and verified against the ACL
  Naming
  Access Control
  Data Location and Routing
  Local searches
  Global searches
  Data Replication
  Objective: locate data quickly regardless of its location
  In OceanStore, entities that are accessed frequently are likely to
reside close to where they are being used
  Approach: combined location and routing
  Two-level routing:
  First, use a fast probabilistic algorithm to route to objects in the
local vicinity (attenuated bloom filters)
  Objects that are frequently accessed reside close to where they are used
  If needed, use a slower but reliable large-scale hierarchical data
structure to locate remote objects (tapestry)
  Each node has a set of neighbors (peers)
  The node associates with each neighbor a probability
of finding the object in the system through that
neighbor
  If the query cannot be satisfied locally, route the query
to the most likely neighbor that can answer it
  This function is implemented using attenuated Bloom
filters
  An efficient, lossy way of describing sets of
data
  A Bloom Filter data structure represents the h1 0
set of data by using an array of bits; each bit h2 1
takes a binary one or a zero value
  A Bloom filter is a bit-vector of length m
associated with a family of k hash functions m
  Each hash function maps the elements of
the set to an integer in [0,m)
  To form a representation of a set, each set hk-1 0
element is hashed and the bits in the vector hk 1
corresponding to the hash functions’
results are set
n: number of objects,
m: Bloom Filter width
k: number of hash functions
  To check if an object is in the set
  Object is hashed
  Corresponding bits of the filter are checked
  If at least one of the bits is not set, object not in the set
  If all bits are set, object may be in the set
  The object may not be in the set if all the hashed bits are
set (false positive)
  The number of false positives for a Bloom filter is a
function of its width, the number of hash functions and
the cardinality of the objects in the set
A Bloom Filter: To check an object’s name against a Bloom filter
summary, the name is hashed with n different hash functions
(here, n=3) and bits corresponding to the result are checked
  An attenuated Bloom filter of depth D is an array of D
normal bloom filters
  The first bloom filter is set of the objects contained
locally at the current node
  The i th bloom filter is the union of all the bloom filters
for all of the nodes a distance i from the current node
  For each neighbor link, an attenuated Bloom filter is
kept
  The node examines the 1st level of each of its
neighbors’ filters
  If matches are found, the query is forwarded to the
closest neighbor
  If no matches are found, the node examines the next
level of each filter and forwards the query if a match is
found
5
11010
n3 X
4b
11100 11010 11010
Looking for 2
11011
11010
1 n1 n2
3
10101 11100
4a
00011
n4
local objects
00011
bloom filters
  When to update a Bloom Filter?
  When new objects are stored or objects are deleted
  The node calculates the changed bits in its bloom filter and
sends those to its neighbors
  Each neighbor that receives such a message updates the bits
in the attenuated bloom filter and these changes are sent out
as well
  The update may be propagated to some servers more
than once, wasting network bandwidth
  Counting Bloom Filters provide an efficient way to
implement the delete operation on the bloom filter
  In a counting filter, the array positions are extended from
being a single bit, to being a counter. The counter counts
the number of times the corresponding bit is set in the
bloom filter.
  The insert operation increments the value of the counter
  The delete operation decrements the value of the counter
  In a P2P system, messages can reach peers through
multiple paths
…
  Employ filtering techniques
  Destination filtering: peers remember the ids of every
update they see for a short period, therefore ignore
subsequent arrivals of the same update through different
paths
  Source filtering: when a node receive a duplicate
message from one of its neighbors, it sends a message to
that neighbor to stop forward updates through that path
  Uses Tapestry to build a reliable
large-scale
hierarchical data structure to locate remote objects
  A query is routed from node to node until the location of
a replica is discovered
  Each node is assigned a random and unique
node-id
  Node ids are used to construct a mesh of

neighbor links
  Every node is connected to other nodes via neighbor links
of various levels
  Level-1 edges connect to nodes with different values in the lowest
digit of their node IDs
  Level-2 edges connect to nodes that match in the lowest digit and
have different values in the second digit
  etc.
  Each node has a neighbor map with multiple levels

  Messages are routed to the destination, digit by digit
***8 -> **98 -> *598 -> 4598
  Incremental suffix based B4F8 L1
routing 0325 3E98
  Each link is labeled with a L2
L3
level number that denotes 0098
the stage of routing that uses 9098 2BB8 L3
the link L2 1598
L4
  At hth hop arrive at nearest L3
node hop(h): hop(h) shares 4598
suffix with B of length h 7598 L4
L4
digits L2 D598
L1
87CA 2118
Potential path for a message originating

at node 0325 destined for node 4598
  Each object with a GUID is mapped to a root node id
  Search walks toward root until the object or a pointer to the
object is located
  OceanStore enhancements for reliability:
  Each object is associated with a set of unique root nodes
  To advertise an object, the node sends a message towards

the object’s root, leaving pointers on the way
  To locate the object, the node sends a message
towards the object’s root, until it finds a pointer in
which case it routes directly to the object
  Tapestry uses redundant neighbor pointers when it
detects a primary route failure
  Uses periodic probes to check link conditions
  Tapestry deterministically chooses multiple root nodes
for each object
  Automatic repair
  Node insertions:
  A new node needs the address of at least one existing node
  It then starts advertising its services and the roles it can
assume to the system through the existing node
  Exiting nodes:
  If possible, the exiting node runs a shutdown script to inform
the system
  In any case, neighbors will detect its absence and update
routing tables accordingly
  Active Data in Object Replicas
  Latest version of the object
  State logging for updates and conflict resolutions
  Archival Objects

  Permanent, read-only version of the object
  Updates are made by clients and all updates are logged
  OceanStore allows concurrent updates
  Serializing updates:
  Since the infrastructure is untrusted, using a master
replica will not work
  Instead, a group of peers called inner ring is responsible
for choosing final commit order
  An object’s inner ring is responsible to
  Generate new versions of an object from client updates
  Generate encoded, archival fragments and distribute
them
  Provide mappings from active GUID to the GUID of
most recent version of the object
  Verify a data object’s legitimate writers
  Maintain an update history providing an undo
mechanism
  Three forms of data: object replicas, secondary
replicas and archival fragments
Primary Replica
(Inner ring) Archive
Secondary
Replicas
Treq Tagree Tdisseminate

  Oceanstore: an architecture for global-scale
persistent storage [Kubiatowicz et al, ASPLOS
2000]
  Probabilistic location and routing [Rhea and

Kubiatowicz, INFOCOM 2002]

Ocean Store

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Ocean Store

Enviado por

Direitos autorais:

Formatos disponíveis

  Καθηγήτρια Βάνα Καλογεράκη

  Topic 6: Structured P2P Systems - OceanStore

  Uses automatic replication for disaster recovery

  Achieves performance comparable to LAN-based

  Objects are modified through user-generated updates (e.g.,

  Writer restriction

  Node ids are used to construct a mesh of

  Each node has a neighbor map with multiple levels

Potential path for a message originating

  To advertise an object, the node sends a message towards

  Archival Objects

Treq Tagree Tdisseminate

  Probabilistic location and routing [Rhea and

Você também pode gostar

Ocean Store

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Ocean Store

Enviado por

Direitos autorais:

Formatos disponíveis

 Καθηγήτρια Βάνα Καλογεράκη

 Topic 6: Structured P2P Systems - OceanStore

 Uses automatic replication for disaster recovery

 Achieves performance comparable to LAN-based

 Objects are modified through user-generated updates (e.g.,

 Writer restriction

 Node ids are used to construct a mesh of

 Each node has a neighbor map with multiple levels

Potential path for a message originating

 To advertise an object, the node sends a message towards

 Archival Objects

Treq Tagree Tdisseminate

 Probabilistic location and routing [Rhea and

Você também pode gostar

  Καθηγήτρια Βάνα Καλογεράκη

  Topic 6: Structured P2P Systems - OceanStore

  Uses automatic replication for disaster recovery

  Achieves performance comparable to LAN-based

  Objects are modified through user-generated updates (e.g.,

  Writer restriction

  Node ids are used to construct a mesh of

  Each node has a neighbor map with multiple levels

  To advertise an object, the node sends a message towards

  Archival Objects

  Probabilistic location and routing [Rhea and