Você está na página 1de 5

IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013

ISSN 2320 6020

A Survey on Dynamic Replication Strategies for Improving Response


Time in Data Grids
Ashish Kumar Singh1, Shashank Srivastava2 and Udai Shanker3
ABSTRACT: Replication is a phenomenon in which we create a exact copies of data for making better data availability. Replication
is process which is good in distributed database system. Dynamic replication is helpful in reducing bandwidth consumption and also
access latency in the large scale database such as data grid. Through replication process we can improve response time in a huge
database like data grid. Different replication strategies can be defined depending on when, where, and how replicas are created and
destroyed. Data grid is the best example of distributed database system. Data grid is a distributed collection of storage and
computational resources that are not bounded within a geophysical location. Whenever we deal with the data grid then we have to deal
with the geographical and temporal locality.
KEYWORDS: Data grid, Replication, DDBS, Scalability.
Introduction
In recent years, distributed databases have become an
important area of information processing, and their importance
will rapidly grow. In distributed database, sites are
interconnected through a network, for managing data in this
wholly interconnected environment we need a method so that
issue of data availability is not arises. This interconnected
environment forms a data-grid. Data grid is a collection of
huge amount of data which is located at multiple sites or at
individual sites where each site can be its own multiple
administrative power as to who may access the data. Data
replication is the method through which we can solve all the
issues of accessing data from the server by improving
performance and availability of data. There are two principle
replication schemes one is Active replication and another is
Passive replication. In a replicated environment, copies of data
are hosted by multiple sites. By increasing the number of
copies or replicas enhances the system performance by
improving the locality of data. If you are dealing with the data
grid then you have deal with temporal and geographical
locality where temporal locality means popular files in past
will be accessed more in future and geographical locality
means files recently accessed by a client are likely to be
accessed by nearby clients.
Ashish Kumar Singh1, Shashank Srivastava2 and Udai
Shanker3
Department of Computer Science & Engineering,
Madan Mohan Malaviya Engineering College,
Gorakhpur-273 010
Email: ashi001.ipec@gmail.com1,
shashank07oct@gmail.com2 and udaigkp@gmail.com3

Problems in file replication are Availability, Reliability, Cost,


Scalability, Throughput, network traffic, Response time, and
Autonomous operation.
The purpose of a distributed file system is to allow
users of physically distributed computers to share data and
storage resources by using a common file system.
The main advantages of replication are [5]:
1. Improved availability: in case of a failure of a node,
the system can replicate the data from another site,
which also improves the availability.
2. Improved performance: since the data is replicated
among several nodes, the user can obtain data from
the node nearest the node or that is the best in terms
of workload.
Data replication is very attractive in order to increase system
throughput and also provide fault-tolerance. However, it is a
challenge to keep data copies consistent.
LITERATURE
ALGORITHMS

REVIEW

OF

REPLICATION

Replication involves using specialized software that looks for


changes in the distributive database. Once the changes have
been identified, the replication process makes all the databases
look the same. The replication process can be complex and
time-consuming depending on the size and number of the
distributed databases. This process can also require a lot of
time and computer resources [14]. So there are several of
algorithm are proposed by the different author for removing
the problems related to the replication process are as follows.
Dynamic Group Protocol [1]
In this paper author develop a protocol Dynamic Group
Protocol in 1992 which adapts itself to changes in site
availability and network connectivity, which allows it to
tolerate n -2 successive replica failures. The dynamic group

ijbstr.org
29

IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013


protocol is to operate in distributed environments where some
of the sites contain full replicas of data objects. These sites can
fail and can be prevented from exchanging messages by
failures of the communication subnet. Dynamic group (DG)
protocol that achieves both fast access and high data
availability. This protocol organizes the replicas of a data
object into small groups of equal size that are not necessarily
disjoint. These groups correspond to the columns of the grid
protocol. The set of all groups for a given data object will
constitute a group set for that data object. When the failure of
a site is detected, the groups are dynamically rearranged in
order to protect the availability of the replicated data object
against subsequent fadures. The number of groups may
therefore decrease as sites fail, but the number of replicas in
each group will remain constant. In this protocol two rules are
define for maintain the quorum are as follows:
Rule 1: Write Rule. The quorum for a write operation
consists of: one complete group of live replicas from the
quorum set Q and one live replica from every other group in
Q.
Rule 2: Read Rule. The quorum for a read operation consists
of either one complete group of live replicas or one live
replica from every group in Q.
It is an efficient replication control protocol for managing
replicated data objects that have more than five replicas. Like
the grid protocol, this dynamic group protocol requires only
O(n) messages per access to enforce mutual consistency
among n replicas. It differs from the grid protocol by
reorganizing itself every time it detects a change in the
number of available sites or the connectivity of the network.
As a result, it can tolerate n-2 successive replica failures and
provides data availability comparable to that of the dynamiclinear voting protocol. As a future work more have to done to
evaluate the impact of simultaneous failures and network
partitions and to devise grouping strategies minimizing the
likelihood that any such event could disable access to the
replicated data.
A Pure Lazy Technique for Scalable Transaction
Processing in Replicated DBS [2]
In this paper author had discovered a pure lazy technique in
2005. the synchronization of replicas can be classified in to
two categories are eager and lazy. In eager synchronization, all
copies of a data item are updated by a single transaction, this
type of synchronization suffered problem when any one of the
replica is unavailable thats why author select lazy
synchronization.
Each update transaction executes at a primary site
while each read-only transaction executes at a secondary site.
This algorithm performance is compared with the performance
of the BLOCK algorithm against algorithm ALG-1SR, which
provides only global serializability (1SR) and not strong
session 1SR. ALG-1SR provides no session guarantees and
simply routes all update transactions to a primary site and all
read-only transactions to the secondary site. ALG-1SR never

ISSN 2320 6020


blocks transactions though they may be blocked by the local
concurrency control at their execution site. ALG-1SR is an
implementation of the DAG(WT) [6] protocol. Technique in
this paper, a pure lazy approach, employs lexicographically
ordered vectors to avoid transaction inversions and allows
scalability of the primary database through partitioning and
replication. We studied the performance of this algorithms and
found that this techniques cost almost same as 1SR, which
does not prevent transaction inversions. We conclude that this
solution is a viable technique for achieving scalability and
preventing transaction inversions in lazy replicated database
systems.
Dynamic Replica Management for Data Grid [3]
In this paper author had discovered an algorithm DRCPS
(Dynamic Replica Creation, Placement, And Selection
algorithm) in 2010 in which there is a discussion about the
about dynamic creation of replicas and replica placement
which can automatically maintain the data by the status of
data. Replica creation decides which file to be replicated and
how many replications to be created. It also reduces the
unnecessary replication. Replica placement deals with the
placement of replicas in appropriate locations. It places the
replica in a appropriate location so as to reduce the placement
cost. Two factors are considered for the placement of replicas,
the response time and the bandwidth. By using this algorithm
Mean job execution time can be minimized, and there is an
effective network usage. It is implemented by using a data grid
simulator, Optorsim developed by European data grid projects.
In this paper author uses dynamic replication to conserve
bandwidth in a hierarchical environment.
In this algorithm replica Selection procedure is based
on whether to select the replica from the master site or from
the region header or from any neighbouring site. The decision
is based on the Weighted Euclidean Distance. As part of future
work we plan to consider additional parameters for placing
the replicas. Replica selection can also be extended by
considering additional parameters such as security. The
dynamism of the sites can be included as future work such that
sites can join and quit in grid at any time
Research of replication in unstructured P2P
Network [4]
In this paper research is based on the sub dividable area. The
author proposes a new mechanism calls junction replication
Method (JRM) in unstructured decentralized P2P network
which reflects the most usual P2p network in the real
environment. Set a new limit, if a normal nodes request over
that limit, the author also sends the replica to it directly. In this
way, the author can avoid nodes become overloaded with low
node in nodes.
In this paper still there is an improvement needed
First, Security of JRM needs to be considered as the nodes
location is easy to be exposed. And Second, for the
development of computer, the gap between super node and

ijbstr.org
30

IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013


normal ones is smaller; the author need to find methods to
spread replicas averagely and decrease the pressure of super
node. And a better load balancing in the whole network is
the final destination.
The FLARE Strategy [9]
In this paper author had discovered Dynamic Data Replication
strategy on VOD Servers in 2012 in which VoD architecture is
used which is proposed hybrid storage system for a large-scale
VoD system. In this strategy Customers are divided into
regional networks and each regional network has one or more
server clusters known as units. In VoD server architecture, the
videos are primarily stored on the hard disks and the replicas
of the popular videos are store on flash SSDs. In FLARE, read
requests for the first ten minutes of popular videos that have
been replicated will be served by flash SSD. The number of
HDD severs in the hybrid storage system is much larger than
that of flash SSD servers. The flash SSD severs and HDD
servers together form a cluster. Each Flash SSD is of a fixed
size. The performance of FLARE by comparing it with the
conventional disk striping strategy for Video-on-Demand
servers. The flash SSDs are arranged in an array in an SSD
node. Various SSD parameters can be modified. Page size,
number of disks used, number of planes are a few to name. It
is observe that increase in the number of requests greatly
improves the performance of the requests. The difference
between the total time taken to serve the requests by FLARE
and the total time taken to serve the requests by sequentially
striped HDD nodes increases with the increase in number
of nodes.
In particular, FLARE improved the performance of
the I/O system by up to 23% when compared with the
traditional striped HDD system. Moreover, FLARE
consistently performs better when compared to the
conventional HDD disk striping technique. In future, this work
can be extended to show that the energy can also be optimized
by using flash SSD nodes along with HDD nodes.
Impact of peer to peer communication on real time
performance of file replication algorithms [11]
In this study, performance of four replication algorithms, two
from the literature and two new algorithms, are evaluated. For
this evaluation, a process oriented and discrete event driven
simulator is developed. A detailed set of simulation studies are
conducted using the simulator and the results obtained are
presented to elaborate on the real time performance of these
replication algorithms.
In this paper there still need refinement for making more
efficient algorithm are These initial yet detailed results on the
impact of the peer to peer communication on the
replication algorithms in terms of the real time grid
performance motivate the development of more sophisticated
replication algorithms to better use of the grid resources,
which will be topic of the future research.

ISSN 2320 6020


PHFS (predictive hierarchal fast spread) [10]
In this paper author had designed a replication technique in
which PHFS technique is uses to decrease the access latency
of data access. It is an extension of fast spread presented by
Ranganthan et al. [12]. This algorithm uses predictive
techniques for the prediction of future usage of files and then
pre-replicates them in hierarchal data grid on a path from
source to client. It works in two phases, in phase one it makes
the file access log files by collecting the access information
from all over the system. In the next phase it applies data
mining techniques like clustering and association rule mining
to find useful knowledge like clusters of files that are accessed
together or most frequent sequential access patterns. In this
algorithm predictive working set(PWS) is formed so whenever
a client requests a file, PHFS finds the PWS of that file and
replicates all members of PWS along with the requested file
on the path from source to client.
It is noticed that the PHFS method is more suitable
for the applications in which the clients keep on working in
the same context for a long time period, and requests of clients
are not random. So it is more suitable for scientific
applications in which researchers are working on a project.
LALW (Latest access Largest Weight)[13]
In this paper author had discovered a dynamic weighted data
replication scheme in a data grid in 2008 in which they
propose an algorithm named LALW in which popular file is
selected for replication and calculates a suitable number of
copies that satisfied the current requirement of network and
also select grid sites for replication. By setting a different
weight for each data access record, the importance of each
record is differentiated. The data access records in the nearer
past have higher weight. It indicates that these records have
higher value of references. In other words, the data access
records in the long past have lower reference values. This
algorithm is based on hierarchical architecture in which there
is a dynamic replication policymaker which is at the center of
the structure which only responsible for making policy for
replicating the data on the suitable site. There is a cluster
Header used to manage the site information in a cluster. The
Policymaker collects the information about accessed files from
all headers. Each site maintains a detailed record for each file.
The record is stored in the form of <timestamp, FileId,
ClusterId >, and each cluster header als maintain a record in
the form of <FileId, ClusterId, Number>, and DRP also
maintain record in the form < FileId, Number>. In constant
time interval, the Policymaker gets the information for the
files from all cluster Headers. Information gathered at
different time intervals has different weights in order to
distinguish the importance between history of records. The
rule of setting weight uses the concept of Half-life. Each
cluster have a cluster header which maintain local information
and DRP maintain a global information.
Performance of this algorithm is measure by using
the Grid simulator OptorSim. OptorSim provides a modular

ijbstr.org
31

IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013


framework to simulate the real Data Grid environment. The
simulation results show that LALW have average job
execution time is similar to LFU optimizer, but excels in terms
of Effective Network Usage. For the future work, we will try
to further reduce the job execution time. There are two factors
to be considered. One is the length of a time interval. If the
length is too short, the information about data access history is
not enough. On the contrary, the information could be overdue
and useless if the length is too long. The other one is the base
of exponential decay. If the base is larger than 1 but smaller
than 2, the declensional rate of weight will be slower. Then
information about data access history will be more important
in contributing to find the popular files.

ISSN 2320 6020


Comparison of dynamic replication strategies
Table 1 gives the comparison of different dynamic replication
algorithm which we discuss in the above section on the basis
of different parameters like availability, reliability, scalability,
type of management, network traffic, response time, and
throughput. There are different algorithms on different
parameter gives good response to the user among all the
algorithms in LALW algorithm there is lot of work to do for
improving the throughput of the system or we can say that
from research point of view LALW algorithm is good for
extending work.

Dynamic Replication Based on Availability and


Popularity [5]
In this paper author proposed an algorithm for dynamic
replication which is based on two parameter are availability
and popularity of the data in the data grid which follow
hierarchical architecture in June 2011. In this algorithm
hierarchical topology is used the root in the topology is used to
bind together the different clusters, so the longer distance
between two clusters consists of two jumps. This root contains
a list of replicas that exists in the system. That is to say on
each cluster, if a request comes in and it cannot be satisfied
within the cluster, the Cluster-Head (CH) leads this request to
the root which in turn directs it to a cluster that has a copy of
this data. In this paper the availability and popularity of the
data is decided by the use of concept of primary copy which
have Boolean variable D(i). If D(i) = False, then it means this
replica is created by cluster head, it cannot be deleted. If D(i)=
True, then it means this replica is created by node, it can be
deleted. Node which has smallest degree of responsibility and
a good stability is a best responsible node. Node which has
greater number of access to a given piece of data is a best
client. If a node receives a request to store the new replica
home, it will react according to the type of replication such as
replication based on availability (best responsible node) or
replication based on popularity (best client node). For the
validation of proposed approach author had developed the
FTSIM simulator by using java language. Through this
simulator author verified that this approach have good
response time in comparison of no replication. Hence
conclusion is that the system must guarantee a minimum level
of data availability and it improves the availability depending
on the demand for this data. For the future work this can be
extended by introducing smart appearance in decision-making
and for the integration of agents within each cluster in order to
implement a cooperative distributed intelligence among
agents.

Table 1: Comparison of different dynamic replication


algorithm
CONCLUSION
In this paper, we have reviewed the different file Replication
algorithms on the basis of different parameters. Also work of
the entire algorithm is discussed and their future work and also
their simulation result on the basis parameters what they take
at the time simulation of their work. A comparison of the
different algorithms with respect to various parameters like
availability, reliability, scalability, throughput, network traffic,
response time etc. in our future work we intend to do increases
the throughput, decrease the response time, and improve the
availability, scalability and reliability of the large distributed
database and also these work we can do on the data grid.

ijbstr.org
32

IJBSTR REVIEW PAPER VOL 1 [ISSUE 7] JULY 2013


REFERENCES
1.

Jehaiz-Fraincois Paris and Perry Kim Sloope:


Dynamic Management of Highly Replicated Data
IEEE 1060-9857/92, 1992.

2.

Khuzaima Daudjee and Kenneth Salem: A Pure Lazy


Technique for Scalable Transaction Processing in
Replicated Databases, Proceedings of the 2005, 11th
(ICPADS'05) 2005 IEEE.

3.

K. Sashi, Antony Selvadoss Thanamani: Dynamic


replication in a data grid using a Modified BHR
Region Based Algorithm. Future Generation
Computer Systems. 27(2): 202-210, 2011.

4.

5.

H. sato, S. Matsuoka, T. Endo, File clustering based


replication algorithm in a grid environment, 2009 9th
IEEE/ ACM international symposium on cluster
computing and the grid., Issue: 978-0-7695-3622-4.
Bakhta Meroufel and Ghalem Belalem Dynamic
Replication Based on Availability and Popularity in
the Presence of Failures Journal of Information
Processing Systems(JIPS), Vol. 8, No. 2, pp. 263278, June 2012.

6.

Y. Breitbart, R. Komondoor, R. Rastogi, S. Seshadri,


and A. Silberschatz. Update propagation protocols
for replicated databases. In Proc. SIGMOD, pages
97108, 1999.

7.

G-F. Hughes, J-F. Murray, and K-K. Delgoda.


Improved disk drive failure warnings. In IEEE
Transaction on reliability, 51(3): 350-357, 2002.

8.

S. Venugopal, R. Buyya, and K. Ramamohanarao, A


Taxonomy of Data Grids for Distributed Data
Sharing, Management, and Processing, ACM
Computing Surveys. 38(1): 1-53, 2006.

9.

Ramya Manjunath, Tao Xie: Dynamic Data


Replication on Flash SSD Assisted Video-onDemand Servers, 2012 IEEE.

ISSN 2320 6020


11. M. atanak, A. Dogan, Impact of peer to peer
communication on real time performance of file
replication algorithm., 2008 23rd international
symposium on computer and information sciences.,
issue: 27-29 Oct. 2008.
12. K. Ranganathan, I. Foster, Design and evaluation of
dynamic replication strategies for a high performance
data grid, in: International Conference on Computing
in High Energy and Nuclear Physics, vol. 2001,
2001.
13. Ruay-Shiung Chang, Hui-Ping Chang, Yun-Ting
Wang, A Dynamic Weighted Data Replication
Strategy in Data Grids, pp. 414-421, 2008 IEEE.
14. http://en.wikipedia.org/wiki/Distributed_database.
15. A. Tamir, An O (pn2 ) algorithm for the p-median
and related problems on tree graphs, Oper. Res.
Lett., vol. 19, pp. 5964, 1996.
16. W. Hoschek, F. J. Janez, A. Samar, H. Stockinger,
and K. Stockinger. Data management in an
international DG project. In Proceedings of GRID
Workshop, pages 7790, 2000.

10. L.M. Khanli, A. Isazadeh, T.N. Shishavanc, PHFS: A


dynamic replication method, to decrease access
latency in multi-tier data grid, Future Generation
Computer Systems (2010).

ijbstr.org
33

Você também pode gostar