Você está na página 1de 25

Cloud Computing in Distributed Systems

Seminar Report

Submitted in partial fulfillment of the requirements


for the degree of
Bachelor of Technology

by

Gurmeet Singh
Roll No. 07010150

Under the supervision of

Dr. Diganta Goswami


Associate Professor

Department of Computer Science and Engineering


Indian Institute of Technology, Guwahati
India

April 2010
Acknowledgements

I sincerely thank my supervisor Dr. Diganta Goswami, whose guidance helped me to do a literature review
on the topic “Cloud Computing in Distributed Systems”. His suggestions to critically analyze the documents
and focus on the issues addressed in the existing work with an innovative view point helped me develop a
research outlook.

I would extend my thanks to Mr. Karthik R, M.Tech second year student at IITG for explaining me his
work with Dr. Goswami on “An Open Cloud Architecture for Provision of IaaS” during the initial stages of
the research.

I also thank Dr. Saikat Guha, researcher at Microsoft Research Bangalore for his suggesting me to read
the work done on “Google Filesystem” and “Amazon’s Highly Available key-value store” during my search
for work implemented on a large scale.

i
Abstract

Cloud Computing as an Internet-based computing; where resources, software and information are pro-
vided to computers on-demand, like a public utility; is emerging as a platform for sharing resources like
infrastructure, software and various applications. The majority of cloud computing infrastructure consists of
reliable services delivered through data centers and built on servers. Clouds often appear as single points of
access for all consumers’ computing needs. Commercial offerings of the cloud are expected to meet quality
of service guarantees for customer satisfaction and typically offer service level agreements. The deployment
of cloud computing can be easily observed while working on Internet, be it Google Docs or Google Apps,
YouTube Video sharing or Picassa Image sharing, Amazon’s Shopping Cart or eBay’s PayPal, the examples
are numerous. This paper does a literature survey on some of the prominent applications of Cloud Com-
puting, and how they meet the requirements of reliability, availability of data, scalability of software and
hardware systems and overall customer satisfaction.
Contents

1 Introduction 2

2 Eucalyptus 3

2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Associated Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Service Orientation 5

3.1 Cloud Computing Open Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Performance Model Driven QoS Guarantees and Optimization . . . . . . . . . . . . . . . . . . 6

4 File System 8

4.1 Design Of Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Issues Addressed by the Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Data Processing 13

5.1 Implementation of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Using Map and Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.2 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Issues Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Data Availability 17

6.1 Design of Dynamo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.1.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.2 Issues Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Conclusion 21

1
Chapter 1

Introduction

Cloud computing includes hosting several services over the Internet, divided into three categories: Software-
as-a-Service (SaaS), Platform-as-a-Service (PaaS) and Infrastructure-as-a- Service (IaaS).

In SaaS is a model of software deployment where a provider licenses an application to users for use as
a service on demand. The vendors may host the application on web servers or download the application to
the consumer device; and after the on-demand contract expires, disabling it. Google Apps, Google Docs [7],
Acrobat.com and Salesforce.com are major SaaS providers.

In PaaS deployment of applications is provided without the cost and complexity of buying and managing
the underlying hardware and software layers, providing all of the facilities required to support the complete
life cycle of the web applications. Amazon Web Services [2], Azure Services Platform, Rackspace Cloud and
Google App Engine are some examples of this category.

IaaS is the delivery of computer infrastructure, usually a platform virtualization; where instead of
purchasing servers, software, data center space or network equipment, clients instead buy the resources as
a fully outsourced service. IaaS like Amazon Web Services provides virtual server instances with unique
IP addresses and blocks of storage on demand. Amazon Elastic Compute Cloud [2] and Eucalyptus [1] are
prominent examples of IaaS.

This report looks into the design considerations and system architechture of some of the well-known
applications of Cloud Computing. The use of core Distributed System techniques is highlighted in such
renderings. Further addressed are the issues faced by the developers before, during and after the design
implementation. A comprehensive report of diverse Cloud renderings having different requirements for their
system due to varying expectations of the customers is presented.

Following this is a chapter on Eucalyptus that is an IaaS rendering developed for research purposes and
research environment. Next is a review of design suggestions focussed on customer satisfaction by using
Service Oriented Architechture or Quality of Service guarantees to the customers while optimizing the profit
of the cloud vendors. The google file system discussed next focusses on the requirements of only the Google
like dominance of append operations in contrast to random writes. Chapter on MapReduce that follows
aims to improve performance while keeping the design as simple as using just Map and Reduce functions of
functional programming. Lastly Amazons Dynamo looks into the issues of high reliability and availability
while trading off consistency for achieving it.

2
Chapter 2

Eucalyptus

EUCALYPTUS [1] is an open-source cloud-computing framework for research purposes that uses storage
and computational infrastructure. It is composed of several hierarchical components, viz. Cloud Controller,
Cluster Controller and Node Controller which interact with each other while supplying facilities to the cloud
client.

Cloud computing systems delivering Infrastructure as a Service dynamically provision Virtual Machine
instances to the client for hosting software services. Scheduling of the VM instances is one of the crucial
questions in cloud computing. Eucalyptus attempts to solve the issues of VM scheduling, storage of data,
network between the nodes of cloud and definition of user interfaces.

2.1 Design

The four components of Eucalyptus that have their own Web-service interface for communication with other
components are described as follows:

• Node Controller: Every node that runs Virtual Machine instances has an execution of Node Con-
troller, NC. An NC is expected to reply to the describeResource and describeInstance queries from
Cluster Controller (CC) about the node’s number of cores, memory size or disk space available; and
handle its subsequent control requests of runInstance by creating virtual network’s endpoint and in-
structing hypervisor to run instance, and terminateInstance by instructing hypervisor to end VM,
rupturing network end-point and cleaning the local data.
• Cluster Controller: CC is the head of many NCs forming a cluster. It has the job of connecting
the Cloud Controller, CLC to the NCs. It distributes the general requests of CLC to all nodes in the
cluster and also trickles down the specific requests of CLC to a set of nodes in the cluster.
• Cloud Controller: CLC issues runInstances, describeInstances, terminateInstances and describeRe-
sources commands to a CC or a set of CCs. It manages all this information and being the only entry
point to the cloud it schedules the VM instances. CLC also gives user visible interface to the cloud, for
them to sign up and query the system; as well as cloud administrator interface for inspecting system
component’s availability.
• Walrus: It is a data storage device which streams data in and out of the cloud, and also stores the
VM images uploaded to Walrus and accessed from the nodes. It supports concurrent and serial data
transfer.

Apart from these high-level components, an essential part of Eucalyptus is the Virtual Overlay Net-
work which is VLAN implementation running over the top of Virtual Machines. Users attach a VM instance
to a “Network” at the boot time. There is a unique VLAN tag for each such network which helps connect
VMs to the public Internet and at the same time separate VMs belonging to different cloud allocations.

3
2.2 Associated Issues

Designed for the academic and research purposes, Eucalyptus deploys an infrastructure for VM creation
controlled by the user. During the design the main issue was use of resources found within research envi-
ronment. Hence the design of Eucalyptus uses hardware commonly found in existing laboratories, including
Linux clusters and server farms.

The networking used is simple and flat Virtual Networking which addresses three issues.

• Connectivity: Virtual overlay network provides connectivity of nodes to public Internet and to other
nodes running VM instances scheduled by the same cloud allocation. Connectivity can be partial too,
so that at least one of the VM instance from a set of instances has connectivity to Internet, using which
user can log in and access all the instances.
• Isolation: The overlay network isolates of the network of the nodes of one cloud allocation from that
of the nodes of some othe cloud allocation for security issues. This prevents VM instance of one cloud
allocation to acquire MAC address of physical resource and interfear with VM instances of other cloud
allocations on the same resource.
• Performance Owing to reduced performace overheads of Virtual Networking in the recent years, the
use of such a network design is favoured.

Research is further facilitated by the modular nature of the design, helping researchers replace one
component for enhancement without the need to interfere with others.

Eucalyptus’ simple design is such that it just offers the basic requirement of provisioning of services[4].
It suffer from huge internal network traffic due to frequent access to data centers by the nodes. The cloud
systems are configured having the peak traffic in to consideration. So most of the nodes and hence the
resources are left idle most of the time[5].

In AOCAPI[5] when CC gives a removeInstance command to the NC, it will neither remove the disk
image from the machine nor disturb the file system. It will just mark it as disabled, treated same as removing
the image. Hence, it can be again be marked as enabled and can run when the image has to be reloaded. This
would eliminate the overhead in fetching the disk image from data center hence reducing network overheads.

For this purpose of smart scheduling the address controller is used which decides to the address where
each user request must be forwarded. The address controller consults the usage register and recent index.
The usage register monitors usage of all nodes by by recording the information like CPU load on a node
exerted by each virtual machine instance running on it. The recent index which records recent set of nodes
used by each user. It stores the address of virtual machine instance to which a request was sent last time
for a user, along with time stamp to find the most recent one.

4
Chapter 3

Service Orientation

The aim of a cloud computing platform is to deliver services to the cloud clients, yet most of the platforms
have not yet adopted service oriented architecture (SOA) to guarantee Quality of Service guarantees to the
clients. At the same time, there should be run time optimization of the cloud so as to attain maximum profit
in the cloud constrained by those QoS guarantees and Service Level Agreements, SLAs between the cloud
vendor and clients.

3.1 Cloud Computing Open Architecture

Figure 3.1: CCOA Overview. Figure taken from [6]

The Cloud Computing Open Architecture, CCOA presented in [6] amalgamates the service oriented
architechture with the virtualization techniques by its seven architectural principles and ten interconnected
modules. This architecture meets the end objectives of creating scalable provisioning platform for cloud
computing which can be configured based on the customer requirements, proposing shared services to provide
cloud offerings to business consumers in a unified way, and maximizing business value through monetization
of computing.

• The first of the seven principles, illustrated along with the ten modules in Fig 3.1 also from [6], the
Integrated Cloud Ecosystem Management includes four modules which are interdependent and
give and take services from one another. Cloud Vendor Dashboard is used for managing internal
operations of the cloud, Cloud Partners Dashboard for Cloud partners who collaborate with Cloud

5
Vendors to leverage services to the Cloud Client while also providing them components through its
interface to the rest of the cloud. Cloud Client Dashboard is the centre of the unified framework clients
use to access services, like Web portals or program based business to enterprise users channel or phone
based individual customer representative channel. Cloud Vendors and Cloud Clients also interface the
Cloud Ecosystem Management module which supervises cloud activities while managing memberships.
• The second principle of Virtualization of Infrastructure is met by using Hardware components in
plug and play mode for hardware virtualization and managing software images, codes, sharing etc. for
software virtualization. The module used is the Core Infrastructure of the Cloud.
• Third principle is Service Orientation which is provided by the Cloud Horizontal Business sub-
module, which are the platform services share by a range a customers; and Cloud Vertical Business
submodule which are more domain or industry specific.
• The fourth principle of Extensible Provisioning and Subscription for cloud segregates Cloud
Provisioning Services from Cloud Subscription Services which share role defining framework and noti-
fication framework but operate provisioning process and subscription process separately.

• The fifth one about Configurable Cloud Offerings are the cloud business solutions in the form of
IaaS like storage cloud Google Docs, SaaS like software leveraged by PayPal for customers, Application
as a Service like web based development tools and Business process as a service like software testing
platforms.

• The next module of Cloud Information Architecture is responsible for the effective communication
of various modules with each other and helps meet the sixth principle of Unified Information
Exchange.
• Lastly the Cloud Quality Governace module is identifies quality indicators and governs their state to
use the Quality of Service parameters for defining reliability, response time and security. With this
module we attain the most important principle of Cloud Quality and Governace.

This architecture successfully amalgamates the power of service oriented architecture missing in many
cloud offerings and the existing use of virtualization technology missing in pure service oriented architectures.

3.2 Performance Model Driven QoS Guarantees and Optimization

While the previous work focuses on how to provide services to the customers in a unified way while making
the best out of the resources, this Performance Model Driven Cloud in [4] monitors the performance delivered
by the cloud, ensures QoS guarantees to customers as well as optimizes the profits in the cloud constrained
by these QoS and SLAs.

Performance model predicts and makes optimal decisions about many decision variables to foresee in-
teractions among these decisions and hence optimizing decisions in autonomic control. A performance model
like LQM has good correspondence to layered resource behaviour. The performance parameters of LQM are
external services, CPU demands of entries and requests within entries. LQM is an extended queueing network
model which predicts throughputs, queueing delays, service delays and utilization of resources.

Quality of Service is a goal of Cloud management which is treated as a constraint on the resource
optimization, that is seeking maximum profit out of minimum number of resources. For a service of class
c, the associated price customer pays to the application is Pc , and response time of the service is assumed
to be the measure of its QoS.

The workload given to a class of service c describes the intensity of the streams of user requests for the
service, in terms either of a throughput fc for user class c, or the number Nc of users that are interacting
and their think time Zc which represents the users mean delay between receiving a response, and issuing its
next request. For each service class c there is a required throughput fc,min or a required user response time
Rc,max . Rc,max can be expressed as a minimum user throughput requirement using Little’s result:

fc ≥ fc,min = Nc /(Rc,max + Zc )

6
Now original delay requirements are changed to throughput requirements and optimization will consider
only the throughput.

Network Flow Model, NFM, depicting the flow of execution commands at the processors is used
for the purpose of optimization. The nodes of NFM are the entities and the arcs arcs with their weights
represent the flow of demand and CPU-sec of execution per second.

Figure 3.2: Network Flow Model. Figure taken from [4].

Each host h has a price of CPU execution of Ch per CPU-sec, including unused CPU-sec allocated in
order to reduce contention delays, In the NFM results, each task t has a reservation αht in CPU-sec per sec,
on some host h. If ζApp and τApp are the sets of user classes and tasks involved with App.

P ROF ITApp = ΣcζApp Pc fc − Σ(h,t)τApp Ch αht

The cloud optimization is to maximize the total profits:

T OT AL = ΣApp P ROF ITApp

This approach of optimization for profit maximization is effective as well as scalable to meet new chal-
lenges of Cloud Computing. The scalability for very large cloud shall come from scaling the performance
model calculations by partitioning them to subsets of processors. These subsets can be very large though,
as observed during the implementations of [4], so as to accommodate many applications. Further work is
being done to account for VM overhead costs, memory allocation, communication delays and licensing costs
of software replicas.

7
Chapter 4

File System

Google is one of the most prominent online examples of the Cloud Computing paradigm. It has a host of
offerings ranging from storage cloud of Google Documents to application cloud of Google Apps. For this
purpose the underlying filesystem must be adapted to the needs of the customers and keeping in mind the
nature of operations that most of the clients follow.

Google File System [7] is a scalable distributed file system for large distributed data-intensive applica-
tions. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high
aggregate performance to a large number of clients. The largest cluster to date provides hundreds of ter-
abytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by
hundreds of clients.

Several observations led to the development of the filesystem:

• Since for such a large system component failures are norms, so the system must have constant moni-
toring, error detection, fault tolerance, and automatic recovery as an integral component.
• The block sizes and the I/O operation parameters among other design assumptions must keep in mind
that the file sizes are huge, multi-GB files are quite common.
• Append data is more common than overwriting existing data at the end of files. Moreover, once
written, the files are only read, and often only sequentially. Random writes are practically non-existent.
Hence performance optimization and atomicity guarantees for the append operation became the focus
of design, while caching the data blocks at the client lost its appeal.
• Lastly, co-designing the applications and the filesystem’s application programming interface hugely
benefits the overall system by increasing the flexibility. For instance relaxing GFS’s consistency model
to simplify the file system without burdening the applications.

4.1 Design Of Google File System

A Google File System cluster consists of a single master and multiple chunkservers. This cluster is accessed
by multiple clients. A GFS client code is associated to each application implements file system API and
communicates with the master and chunkservers to exchange data on the behalf of the application.

• Chunkserver and the Chunks The files are divided into chunks of fixed size, discussed in details in
Section 4.2, and each chunk is assigned a unique chunk handler by the master during its creation. The
chunks are replicated on multiple chunkservers for the purpose of reliability; by default three replicas
are stored, though user is given all the facility to increase or decrease the level of replication supported
for a particular regions of the file namespace.

8
A chunkserver stores the information about what chunks it has. Master does not persistently store
a record of which chunkservers have a replica of a given chunk. But since the queries regarding
chunk locations from all the clients come to master, the master maintains this database by polling the
chunkservers at the start-up and later exchanging HeartBeat messages with them.

• Master
GFS master, like chunkservers and the clients, is a Linux machine that maintains all the file system
metadata including namespace, access control information, file to chunk mappings and chunk locations.
Yet the master does not become a bottleneck for the system. Clients never read or write file data
through master.

– If a client wants to read data from a file it finds chunk index within the file using chunk size and
byte offset specified by the application. It then asks master which chunkserver it should request
for the file name and index and caches the obtained information about chunk handle and location
of replicas temporarily using file name and index as the key. Further interaction is done with
chunkserver only and master is not disturbed while the file is read or written. Master is next
requested when the cached information expires or the file is reopened.
– Master stores three kinds of metadata in its memory, file and chunk namespaces, mappings
from files to chunks, and the locations of each chunk’s replicas. Since metadata is stored in-
memory, master operations are fast. The first two kinds of metadata, for the sake of reliability
are kept persistently by logging operations on master’s local disk and also replicated on remote
machines. While the last one, chunk replicas’ locations are not stored persistently as discussed in
Section 4.1.
– As introduced above, master maintains operation logs that contain records of metadata changes
and logical time line of operations for reliability. Logical time line helps to identify files and chunks
uniquely and define the order of concurrent operations. Log size should be small to keep startup
fast, hence master checkpoints its state whenever log grows beyond a size. Master now uses
another file for logging and checkpoint is created in a separate thread. Recovery needs latest
checkpoint and the subsequent log files.
– Master is also involved in many other processes like atomic record appends, write operations,
snapshot operations, namespace management and locking, replica placement, garbage collection
and stale replica detection. Implementation details of some are given in the following Section 4.2,
please refer to [7] for further details.

4.2 Issues Addressed by the Implementation

A large number of issues are addressed by the File System, details of which are obscured by the implemen-
tation. These have been studied and listed below:

• The new file system obviates the requirement of caching file data at both clients and the chunkservers.
Clients need not cache as the files and the working sets are huge and most of references to data
are sequential, so caching would lead to little benefits, but invite cache coherence issues along with
complicating the client. Chunkservers also need not cache file data as chunks are stored in the form of
local files at the chunkservers which can utilize in-built Linux’s buffer cache.
• Deciding the chunk size is one of the key design parameters. A large chunk size reduces clients’ need
to request master with too many file name and index to get the chunkserver handle and locations;
reduces network overhead since more operations of clients are now with the same chunkserver and
hence it maintains persistent TCP connection with same chunkserver for longer duration; and obviously
reduces the metadata stored at the master.
Whereas for a small file with small number of chunks, perhaps one, the only chunkservers can become
hot spots for access. Practically applications using GFS read files distributed over many files sequen-
tially, so hot spots are not a major issue. So a chunk size is appropriately chosen to be 64 MB which
is quite large compared to typical filesystems block sizes.

9
• If master persistently stores the chunk locations then problem of keeping the master and chunkservers
in sync crops up. Hence, GFS uses polling of chunkservers at startup and HeartBeat messages to mon-
itor chunkservers’ status. This makes life easy as the chunkservers in a large clusters join and leave
cluster, change names, restart, fail and so on.

• GFS decouples the data flow and control flow entirely. While control flows from client to the
Primary Replica holding lease on the chunk, given by the master, and then to other Secondary Replicas
all from Primary Replica; data can flows in any order, so client sends it to the nearest chunkserver,
be it primary or not, similarly all chunkservers forward data to nearest chunkserver which has not yet
received the data. Serial data transfer utilizes each machine’s network bandwidth compared to sending
data in topology such as tree and forwarding data to nearest chunkserver avoids network bottlenecks
and high latency links. Also the latency is minimized by forwarding the data to next chunkserver as
soon as it starts pipelining.

Figure 4.1: Write Control Flow and Data Flow. Figure taken from [7].

• From the observation of the kinds of data requests by clients, it follows that append operations are the
most common ones, so GFS provides an atomic append operation called Record Append which is
heavily used by multiple clients on different machines to append same file concurrently. GFS appends
data given to the file at lease once atomically (i.e. as one continuous sequence of bytes) at an offset of
its own choosing and returns that offset.
Since the chunk size is fixed, the amount of data that can be appended to a chunk should also be
restricted, else the chunks overflow soon and data is lost. For this when client gives append frequent
(control flow) to a Primary Replica while pushing data to all replicas serially, the Primary checks if
adding data would cause overflow. If so it pads the chunks to its maximum size and tell Secondaries
to do the same. Then it replies to the client saying it should retry on some other chunk. Fixing the
record append data to at most one fourth of chunk size ensures at least four appends on the chunk.
If append fails at any of the replicas, client retries it, leading to different replicas of same chunk having
different data possibly having partial or complete duplicates of the same record. Such replicas are
allowed in GFS since append only requires that record is written on all replicas at the same offset,
what goes on at other locations is immaterial as that offset is not given to the application so it won’t
fetch this offset in a normal case.

• The Snapshot operation makes an instantaneous copy of a file or a directory tree without much
interruptions to ongoing mutations. It is used by users to quickly create branch copies of huge data
sets or to checkpoint current state before making changes that might require an easy roll back.
The Snapshot operation is mostly handled by the Master which logs the operation immediately on
arrival of a request. It then calls off the leases it gave to the chunkservers holding chunks of the files so
that when a client wants to write to a chunk it is not allowed due to absence of lease with the expected
chunkserver and hence the client contacts master to find lease holder. So master will at that point of
time create a copy of the chunk. This shifts huge overhead of snapshot operation while snapshot is
being taken, to small overhead while the snapshot file’s chunks are being written.
After master revokes the leases on the file’s chunks, it creates a copy of the metadata of the file which
also point to the same chunks as the source file. Hence these chunks have reference count more than
one (at least two) which is observed when a write request for some chunk arrives and hence master then

10
decides to first create a copy of the chunk on the same chunkserver (hence reducing network overhead )
and then replying to the client.
• Replica Placement The chunks of GFS are replicated at the chunkservers distributed across multiple
machine racks to ensure reliability in situations such as failure of entire rack or a network switch discon-
necting a rack from the system temporarily. Also, this helps to utilize network bandwidth of multiple
racks especially for chunk read requests, if reads of a chunk too are distributed over chunkservers (and
hence across racks).
• Garbage Collection The lazily done garbage collection by the GFS master has several notable points
about its mechanism, which enhance performance of the system, listed below:

– Safety against accidental delete: When a file is deleted by an application master logs the operation,
marks it deleted and renames it to a hidden name which can be seen by the application under the
hidden name as well as restored to the original name, hence providing Safety against accidental
delete.
– Scan of the file system namespace: Master does a regular Scan of the file system namespace during
which it removes the hidden files if they have existed for more than three days, a configurable
interval.
– Scan of the chunk namespace: Master’s similar regular Scan of the chunk namespace identifies
chunks not reachable from any file and erases metadata of those chunks. During the HeartBeat
messages with chunkservers, they report to the master a subset of its chunks and master tells
which chunks are not available in master’s metadata, and hence are free to be deleted by the
chunkserver.
– Simple and Reliable: The above method is Simple and Reliable for large scale distributed system
like GFS where component failure is a norm. For situations like unsuccessful chunk creation in
some chunkservers and successful in a few, master may not be aware of the existence of some
chunks which are removed during the Garbage Collection.
– The Major Advantage: The primary advantage of the Garbage Collection Mechanism is that it is
mostly done along with background activities of the master like HeartBeat messages and chunk
and file system namespace scan. Master does garbage collection when it is relatively free so as to
give timely service to the client requests.
– The Main Disadvantage: The only disadvantage is encountered when applications create and
delete temporary files repeatedly leading to tight storage and hence preventing instant re-usability
of storage. This is addressed by accelerating storage re-usability when a deleted file is deleted
again, just like deleting a file from ‘trash’ or ‘recycle bin’ in our personal systems. Also, user can
apply different replication and reclamation policies and configure time after which a hidden file is
removed from namespace and in-memory metadata, to files in different part of the namespace.

• Availability Issues Component failures in the system can lead to unavailability of data or in worst
case corrupted data. Here we discuss how GFS ensures availability and the next section on Data
Integrity explains how corruption of data is handled. Availability is ensured at three levels:

– Fast Recovery: All the components of the system, the chunkservers and the master are designed
to restart from a shut down as well as from a failure within seconds, and restore their states in
no time. In fact, there is no difference between a normal termination and a component failure.
– Chunk Replication: As discussed earlier, a chunk is replicated on several, usually three, chunkservers
across different machine racks to ensure reliability and hence availability.
– Master Replication: The master state, operation logs, and checkpoints are all replicated on mul-
tiple machines for reliability. A change to master’s state is considered committed only after it has
been successfully written to local disk as well as all replicas.
When master fails it can start in no time. But when it is not able to, perhaps due to disk or
machine failure, then a ‘Monitoring Infrastructure’ outside GFS starts A New Master on a
different machine with the replicated log and a canonical name, say gfs-test, which is nothing but
a DNS alias of the master.
Also provided are Shadow Masters, which lag the primary master by a second’s fraction, to
enhance Scalability and Availability for read operations. Due to this negligible lag, the applications
can end up reading stale file metadata like directory contents, but not the file data itself as it is
read from the chunksevers. They also provide read-only access to filesystem when master is down.

11
• Data Integrity It is not feasible to check if a replica has uncorrupted data by comparing data across
several chunkservers due to the network and performance overheads as well as since GFS does not
guarantee that legal replicas of different chunks will have the same data; consider for example record
append operation which can end up with chunks having different data (as discussed in details in 4.2)
at offsets which are not used by the application, but the mere existence of such useless data can lead
to false interpretation that data on a chunk is illegal.
So GFS decided to verify correctness of the data (ensuring reliability against component failures cor-
rupting data) on a chunk by simply calculating 32 checksum for each 64 KB block of a 64 MB chunk,
and storing it with other chunk metadata and also storing it with other logs. Some notable points
about checksum mechanism are as:

– Read Operation: When a client or another chunkserver requests for data from a chunkserver, it
verifies for checksum for the range of blocks requested and then returns data. If there is come
corrupted data, the chunkserver returns an error to who requested and reports mismatch to the
master. While the requester asks master a location of some replica of the chunk and repeats
the request, master copies a valid replica from another chunkserver to some chunkserver, on
completion of which master instructs the chunkserver with corrupted data to delete its replica.
GFS client code reduces the performance overhead by trying to align reads at block boundaries.
– Append Operation: Being a dominant operation for GFS, checksum calculation for this operation
is heavily optimized by incrementally recalculating checksum for the last written block of a chunk
and computing new check-sums for successive blocks. If that partially check-summed block had
corrupted data we don’t detect it now and let it be detected during Read Operation to eliminate
overheads from Append Operation.
– Write Operation: If a write operation’s data traverses blocks such that operation has to partially
overwrite the first and/or the last blocks, then we need to check if the data not being overwritten
on the first and/or the last blocks is correct or not. So write operation does the check and
continues to write. A mismatch is handled in a way similar to the read operation.

Chunkservers can, during their idle times, scan and verify contents of inactive chunks which are not
checked by any read/write operation for a long time, following which master places a valid chunk
for every illegal chunk detected at the scanning chunkserver. This prevents the master from falsely
believing that it has enough legal copies of inactive chunks while most of the copies might have been
corrupted.

Hence GFS with its centralized master approach meets the requirements of simplicity, flexibility, relia-
bility, fault tolerance and high performance. It is widely used within Google as storage platform as well as
for research purposes, for instance deploying MapReduce explained in the next chapter.

12
Chapter 5

Data Processing

For data processing and computations on large data sets, Cloud Computing needs a simplified programming
model which is parallelizable, fault tolerant and does data distribution and load balancing over a large
distributed system. On top of it, it should be easy to use. To address these needs, Google came with
its MapReduce [3] which delivers all the requirements along with the ease of use as the messy details of
parallelization etc. are hidden in a library. The implementation and issues addressed discussed in the
subsequent sections explain the working of this simple approach to data processing.

5.1 Implementation of MapReduce

MapReduce is based on Map and Reduce primitives of functional programming languages like Lisp. This
is done as most of the computations at Google involve a Map function to map a set of values to a set of
intermediate (key, value) pairs and then combining those intermediates with same key and Reducing the
combined data of values with same key to get the final output.

5.1.1 Using Map and Reduce

There is a MapReduce library which a use of MapReduce uses to express the computation to be done in
terms of Map and Reduce functions, which have to be written by the user only.

• Map: This function has to be written by the user to produce, from the given input (key, value)pair,
a set of intermediate (key, value) pairs, in such a way that all the values of the same intermediate
key say I shall be grouped together by the underlying MapReduce library and given as input to the
Reduce function.
This can be summarized as: (k1 , v1 ) → list(k2 , v2 )
The Map function as in normal functional programming, can change the domain of the input key and
values so that (k1 , v1 ) transforms to (k2 , v2 ).
• Reduce: This function made by the user accepts an intermediate key I from the MapReduce library
and a set of intermediate values for that key. It merges or combines the data of these intermediate
values to a smaller set of values; mostly zero or one values are generated. To handle the situation
of very large set of intermediate values, an iterator is used to deliver the intermediate values at the
Reduce function.
This can be summarized as: (k2 , list(v2 )) → list(v2 )
The Reduce function as expected, does not change the domains of its input keys and values.
• Exemplification: For instance in a simple case of finding the lines of a file that have occurrence of
a set of patterns, the Distributed Systems would prefer a Distributed Grep over a conventional one.

13
The Map function has to be written such that it sends a line to output if it has the pattern, the key;
whereas Reduce is an identity function that copies input to the output file.
Similarly to count the number of times each word occurs in a large collection of files, Map takes the
(f ile − name, f ilecontents) as input and returns (word, number of occurrences) as the intermediate
output. The reduce function on continuously getting a list of count values of each word from its iterator
as the input, keeps on incrementing the count for each word and outputs the final count for each word.

5.1.2 Execution

Figure 5.1: Execution of Map and Reduce. Figure taken from [3].

The figure above shows the flow of a MapReduce operation. The input data is partitioned into a set of
M splits, processed in parallel by different machines. Reduce invocations are distributed by partitioning the
intermediate key space into R partitions, using a user defined partitioning function, say hash(key) mod R;
note that R is also defined by the user.

1. The MapReduce library splits the input files into M pieces of typically 16 MB to 64 MB per piece and
forks many copies of the program on a cluster of machines.
2. One of the copies of the program is the Master who assigns M Map tasks and R Reduce tasks to the
rest of the workers. The master picks idle workers and assigns each one a map task or a reduce task.
The master is also responsible to store for each task, a state of the task- idle, in-progress or completed,
and the identity of worker machine handling the non-idle tasks.
3. A worker assigned a map task reads the contents of the corresponding input split, parses (key, value)
pairs out of the input data, passes each pair to the Map function, and buffers the intermediate
(key, value) pairs in the memory.
4. The buffered pairs are periodically written to local disk, partitioned into R regions by the partitioning
function. The locations of these buffered pairs on the local disk are passed back to the master, who
forwards these locations to the reduce workers.

14
5. A reduce worker on being notified by the master about these locations, uses remote procedure calls to
read the data from the local disks of the map workers and sorts the data by the intermediate keys,
probably using external sort if data is large.
6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key, it
passes the key and the corresponding set of intermediate values to the users Reduce function, whose
output is appended to a final output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the master wakes up the user program,
and the MapReduce call in the user program returns back to the user code.

5.2 Issues Addressed


• Fault Tolerance: To obscure the fault tolerance from the users, the MapReduce library must handle
fault tolerance. Master finds out if a worker has failed through its periodic ping messages to the worker,
the failure of reply of which makes the master mark the worker as failed.

– For failure of a Map Worker while task is in progress or even after the worker has completed its
map task, the task has to be re-executed on some other worker with its state reset to idle; as
the failed worker was writing its output on its local disk which is now not accessible to Reduce
Workers. But master on receiving a completion message for an already completed task ignores
the message. If the completion message is for a map task never completed earlier, then master
records names of R files where the Map Worker wrote its output and subsequently all Reduce
Workers should access this new worker for the intermediate output.
– For failure of a Reduce Worker while task is in progress, it is reset to idle state and is rescheduled
on some other worker. But the already completed reduce task need not be re-executed since
Reduce Workers write their output in the global filesystem. If multiple reduce tasks are executed
on multiple machines, on completion there will be multiple rename operations for the same final
output file, which is handled by the atomic rename operation of the filesystem used, in this case
GFS 4.
– For failure of Master, situation can be handled by making the master write periodic checkpoints
and on a failure recovery start from previous checkpoint. But since there is only one master
so probability that its fails is negligible. So current implementation of MapReduce aborts the
computation and restarts the entire process if client wants.

• Network Bandwidth: Google observed that network bandwidth is a scarce resource and hence
MapReduce optimizes use of the same by making the Map tasks scheduled preferably on same machine
that contains a replica (refer to GFS) of the input data. If that fails, try to schedule it on some
machine on the same network switch as the one containing a replica. Further writing a single copy of
the intermediate data to the Map Worker’s local disk saves network bandwidth.

• Load Balancing: For best performance, values of M and R should be much larger than the number of
workers available to firstly improve dynamic load balancing by spreading tasks to many workers, and
as and when they complete, assigning them new jobs; and secondly speeding up recovery if a worker
fails, as then many map tasks can be re-executed on some other machine.
But M is limited as for underlying file system, a data ‘chunk’ can be read completely if it is less than
64 MB. R on the other hand is ofter limited by user as increasing R increases the number of output
files. So M is chosen so that each task is roughly 16 MB or 64 MB, and R is a multiple of M. Typically
for 2,000 worker machines, a MapReduce computation uses M = 200,000 and R = 5,000.
• Partitioning Function: MapReduce leaves it for user to decide the partition function to partition
intermediate data to R partitions as a user may want to see output in a particular fashion. For instance
for if output keys are URLs, user can specify partition function such that all entries for same host fall
into the same output file, using something like hash(Host-name(urlkey)) mod R.
• Combiner Function: If there is a lot of repetition in the intermediate keys produced by map tasks;
for instance in case of the word counting example where word frequencies follow a Zipf distribution,
each map task will produce many records of the form (the, 1), all of which will have to be sent over the
network to a single reduce task; then to reduce the overheads MapReduce allows the user to specify

15
an optional Combiner function, executed on each machine that performs a map task, that does partial
merging of this data before it is sent over the network. This function speeds up certain classes of
MapReduce operations.

The implementation of MapReduce scales to large clusters of machines comprising thousands of ma-
chines, making efficient use of the machine resources and hence being suitable for use on many of the large
computational problems encountered at Google. It has been widely deployed for large-scale machine learning
problems, clustering problems for the Google News and Froogle products, extraction of data used to produce
reports of popular queries, extraction of properties of web pages for new experiments and products and
large-scale graph computations.

The use of functional programming with user specified map and reduce leads to easily parallelized large
computations. Fault tolerance is handled by re-executions, data distribution by Master, load balancing by
simple techniques as carefully choosing number of Map and Reduce Workers. All this is guaranteed hiding
the complexities from the user and hence is easy to use, making MapReduce good for simplified large scale
data processing.

16
Chapter 6

Data Availability

The Amazon Web Services are a collection of remote computing services (also called web services) that
together make up a cloud computing platform, offered over the Internet by Amazon.com. In August 2006,
Amazon introduced Amazon Elastic Compute Cloud (Amazon EC2), a virtual site farm, allowing users to
use the Amazon infrastructure with its high reliability to run diverse applications ranging from running
simulations to web hosting.

The biggest challenge Amazon.com, one of the largest e-commerce platform in the world, faces is Relia-
bility at a big scale, as even the slightest breakdown can lead to significant financial consequences and shake
the customer trust. Dynamo, used by some of the Amazon’s core services, is a highly available key-value
storage system used to provide an always-on experience. To achieve high availability, Dynamo sacrifices
consistency under certain failure scenarios.

6.1 Design of Dynamo

Since when dealing with the possibility of network failures, strong consistency and high data availability
cannot be achieved simultaneously, Dynamo is designed to be an “eventually consistent data store”; that is
all updates reach all replicas eventually, changes are allowed to propagate to replicas in the background, and
concurrent, disconnected work is tolerated in contrast to the Google File System discussed earlier.

6.1.1 Observations

Some interesting observations about Amazon, apart from applications requiring high availability, that influ-
enced its design are listed as:

• A large part of Amazons services can work with a simple query model having simple read and write
operations, and do not need any relational schema. Dynamo targets applications that need to store
objects that are relatively small (usually less than 1 MB) as opposed to files of terabyte order handled
by GFS.
• Since the data stores have poor availability, Dynamo targets applications that operate with weaker
consistency, the “C” of ACID, if this results in high availability. Dynamo does not provide any isolation
guarantees and permits only single key updates.
• In Amazon’s platform, services have stringent latency requirements, so services must be able to con-
figure Dynamo such that they consistently achieve their latency and throughput requirements. As a
result the trade-offs are in performance, cost efficiency, availability, and durability guarantees.
• A crucial requirement for many Amazon applications is that they need an “always write-able”
data store where no updates are rejected due to failures or concurrent writes. Dynamo targets such
applications.

17
6.1.2 Architecture

Dynamo stores objects through two operations: get() and put(). The get(key) operation locates the object
replicas associated with the key in the storage system and returns a single object or a list of objects with
conflicting versions along with a context. The put(key, context, object) operation finds where the replicas
of the object should be placed based on the associated key, and writes the replicas to disk. The context
includes information such as the version of the object.

The core distributed systems techniques used in Dynamo are partitioning, replication, versioning, mem-
bership, failure handling and scaling each of which is explained below.

• Partitioning: To make Dynamo scale incrementally it uses mechanism to dynamically partition the
data over the set of storage hosts, called nodes. Dynamo uses consistent hashing to distribute the
load across multiple storage hosts. Consistent hashing treats the output range of a hash function as
a fixed ring. Each node is assigned a random value within this space which represents its position on
the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to
yield its position on the ring, and then walking the ring clockwise to find the first node with a position
larger than the items position.
Each node is made responsible for the region in the ring between itself and its predecessor node on
the ring. Hence the joining or leaving of a node only affects its neighbors and other nodes remain
unaffected.
Since random position assignment of each node on the ring leads to non-uniform data and load distri-
bution, Dynamo uses a variant of consistent hashing: instead of mapping a node to a single point in
the circle, each node gets assigned to multiple points in the ring. A virtual node looks like a single
node in the system, but when a new node is added to the system, it is assigned multiple positions,
called tokens in the ring.
Using virtual nodes has the following advantages:

– When a node is unavailable, the load handled by this node is evenly dispersed across the remaining
available nodes.
– When a node is again available, or a new node is introduced to the system, the newly available
node receives an almost amount of load from each of the other available nodes.
– Heterogeneity in the physical infrastructure is taken into consideration by assigning the number
of virtual nodes that node is responsible for, based on its capacity.

Figure 6.1: Nodes and Keys in the ring. Figure taken from [2].

• Replication: To achieve high availability and durability, Dynamo replicates its data on multiple hosts.
Each data item is replicated at N hosts. Each key, k, is assigned to a coordinator node who is the in
charge of the replication of the data items that fall within its range.
Each node is responsible for the region of the ring between it and its N th predecessor. In the Figure
6.1, node B replicates the key k at nodes C and D in addition to storing it locally. Node D will store
the keys that fall in the ranges (A, B], (B, C], and (C, D]. The list of nodes that is responsible for
storing a particular key is called the preference list. The preference list for a key is constructed by

18
skipping positions in the ring to ensure that the list contains only distinct physical nodes, as a node
may be holding more than one virtual positions in the ring.
• Version Handling: Dynamo uses vector clocks in order to find conflict between different versions of
the same object. A vector clock, associated with every version of every object, is a list of (node, counter)
pairs. If the counters on the first objects clock are less than or equal to all of the nodes in the second
clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes are
considered to be in conflict and require reconciliation.

Figure 6.2: Version Handling. Figure taken from [2].

In Dynamo, when a client wishes to update an object, it must specify which version (found in the
context obtained from an earlier read operation) it is updating. On getting a read request, if Dynamo
has multiple branches (as shown in the Figure 6.2 above) that cannot be syntactically reconciled, it
will return all the objects at the leaves, with the version information in the context. An update using
this context is considered to have reconciled the divergent versions and the branches are collapsed into
a single new version as D3 and D4 are reconciled to D5 by the node Sx , that merges the vector clocks
of the two versions.
• Ring Membership: The administrator uses a command line tool or a browser to connect to a Dynamo
node and issue a membership change to join a node to a ring or remove a node from a ring. A gossip-
based protocol, in which each node contacts a peer chosen at random every second and the two nodes
efficiently reconcile their persisted membership change histories, propagates membership changes. This
helps to maintain an eventually consistent view of membership. When a node starts for the first time,
it chooses its set of tokens and maps nodes to their respective token sets.
• Failure Detection: Since nodes are informed of permanent node memberships by the explicit node
join and leave methods, temporary node failures can be detected by the individual nodes when they
fail to communicate with others while forwarding requests.
Node A may consider node B failed if node B does not respond to node A’s messages, even if B is
responsive to node C ’s messages. Then Node A uses alternate nodes to service requests that map to
B ’s partitions; also A periodically retries B to check for its recovery.
• Scalability: When X is added to the system, it gets responsibility of storing keys in the ranges
(F, G], (G, A] and (A, X]. As a consequence, nodes B, C and D should not store the keys in these
respective ranges and they will offer to transfer the appropriate set of keys to X. When a node is
removed from the system, the reallocation of keys happens in a reverse process.

6.2 Issues Addressed


• Conflict Resolution: For many Amazon services, rejecting customer updates results in a poor cus-
tomer experience. Like the shopping cart service must allow customers to add and remove items from
their shopping cart even amidst network and server failures. Hence complexity of conflict resolution is
pushed to the reads in order to ensure that writes are never rejected.

19
Secondly, since the data store shall use simple policies like last write wins for conflict resolution whereas
application is more aware of the data schema hence can choose a resolution mechanism to best satisfy
the client. Like shopping cart should merge two versions to give a unified shopping cart!
• Decentralization: In the past, centralized approach resulted in several outages and to avoid it a
decentralized approach is used which leads to a simpler, more scalable and more available system as
well as maintains symmetry in the system.
• Latency Sensitivity: Dynamo is built for applications that require at least 99.9% of read and write
operations to be performed within a few hundred milliseconds. To meet these stringent latency require-
ments, each node maintains enough routing information locally to route a request to the appropriate
node directly.
• Vector Clock Size: Multiple server failures can lead to an object being written by nodes that are not
in the top N nodes in the preference list, causing the vector clock size to grow. This is handled by adding
another field to vector clock, which now stores (node, counter, time − stamp), where time − stamp is
the last time the node updated the object. When number of triplets in an object reach a maximum
allowable value, the vector clock with least time-stamp is removed.
• Ring Membership: In Amazon node outages are often transient but may last for extended intervals
but rarely implies a permanent failure, and therefore should not result in re-balancing of the partition
assignment or repair of the unreachable replicas. Hence it uses an explicit mechanism to initiate the
addition and removal of nodes from a Dynamo ring.
But the protocol of Ring Membership discussed earlier can result in a logically partitioned Dynamo
ring. For instance say, the administrator added node A to the ring, later added node B to the ring;
then nodes A and B would each consider itself a member of the ring, yet neither of them would be
immediately aware of the other leading to logical partition of the key space.
To prevent logical partitions, some Dynamo nodes play the role of seeds, which are nodes discovered
via an external mechanism and are known to all nodes. Because all nodes eventually reconcile their
membership with a seed, logical partitions are highly unlikely.
• Performance and Durability: A few customer-facing services required higher levels of performance
than the 99.9th percentile. So now each write operation is stored in the buffer and gets periodically
written to storage by a writer thread. Read operations first check if the requested key is present in the
buffer to avoid overheads.
This scheme trades durability for performance, as a server crash can result in missing writes that were
queued up in the buffer. Hence Dynamo chooses one out of the N replicas to perform a normal durable
write. Since the coordinator waits only for W responses, the performance of the write operation is
not affected by the performance of only one durable write among more than W non-durable ones.

20
Chapter 7

Conclusion

The design of several diverse platforms deploying cloud computing are studied in details and their issues have
been highlighted for easy deployment of similar solutions for such issues. For instance simply using Linux
clusters and server farms with easy and modular design of Eucalyptus developed for reseach environments.
Applications that need to focus on delivery of best services to the customers and maximizing profit have the
aforementioned architecture to easily deliver services and satisfy the customers by use of service orientation
or guarantee them QoS at the same time as optimizing the vendors profit. Some applications requiring high
availability can use design similar to Dynamos, to trade off between availability and consistency. For those
focussing on performance can deploy a simple MapReduce architecture of Google. Google file system apart
from detailing a system that is scalable, fault tolerant and delivers high performance; also teaches to observe
the customer behavior and requirements closely and optimizing to deliver them best services possible.

21
Bibliography

[1] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, and
Dmitrii Zagorodnov. The Eucalyptus Open-source Cloud-computing System. In Proceedings of Cloud
Computing and Its Applications [online], October 2008.

[2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,
Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. Dynamo: Amazons
Highly Available Key-value Store. SOSP07, October 1417, 2007, Stevenson, Washington, USA.
[3] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004.

[4] Jim (Zhanwen) Li, John Chinneck, Murray Woodside, Marin Litoiu, Gabriel Iszlai. Performance Model
Driven QoS Guarantees and Optimization. In CLOUD 09: Proceedings of the 2009 ICSE Workshop on
Software Engineering Challenges of Cloud Computing, Pages 1522.
[5] Karthik R, Diganta Goswami. An Open Cloud Architecture for Provision of IaaS. [Accepted]

[6] Liang-Jie Zhang and Qun Zhou. CCOA: Cloud Computing Open Architecture. IEEE International Con-
ference on Web Services, 2009.
[7] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. SOSP’03, October
1922, 2003, Bolton Landing, New York, USA.

22

Você também pode gostar