Escolar Documentos
Profissional Documentos
Cultura Documentos
Due to their nature, Eolic parks are situated in zones with troublesome access. Hence,
management of Eolic parks using remote sensing techniques is of nice importance. Further, the
large quantity of information managed by Eolic parks, along with their nature (distributed,
heterogeneous, created, consumed at different times, etc.) makes them ideal to use big
information techniques. During this paper, we tend to present a multilayer hardware/software
architecture that applies cloud computing techniques for managing massive data from Eolic
parks. This design allows tackling the processing of huge, distributed, and heterogeneous
information sets during a remote sensing context. An innovative contribution of this work is the
mix of various techniques at three different layers of the proposed hardware/software
architecture for Eolic park big knowledge management and processing.
1. INTRODUCTION
With the increasing explosion of distributed data, the huge treasures hidden in it are
waiting for us to explore for providing valuable insights. To illustrate, social web sites such as
Facebook can uncover usage patterns and hidden correlations by analyzing the web site history
records (e.g., click records, activity records et al.) to detect social hot event and facilitate its
marketing decision (e.g., advertisement recommendation), and the Square Kilometre Array
(SKA) [1], an international project to build the world’s largest telescope distributed over several
countries, need to fusion the geographically dispersed data for scientific applications. However,
due to the properties such as large-scale volume, high complexity and dispersiveness of big data
coupled with the scarcity of Widearea bandwidth (e.g., trans-oceanic link ), it is inefficient
and/or infeasible to process the data with centralized solutions [2]. This has fueled strong
companies from industry to deploy multidatacenter cloud and hybrid cloud. These cloud
technologies offer a powerful and cost-effective solution to deal with increasingly high velocity
of big data generated from geo-distributed sources (e.g., Facebook, Google and Microsoft etc).
1.2 OBJECTIVE
In this work a framework has been proposed that can systematically handle the issues of
data movement, resource provisioning as well as reducer selection under the context of running
MapReduce across multiple datacenters, and VMs of different types and dynamic prices. It is
used to formulate the complex cost optimization problem as a jointed stochastic integer
nonlinear optimization problem and solve it using Lyapunov optimization framework by
transforming the original problem into three independent subproblems (data movement, resource
provisioning and reduce selection) that can be solved with some simple solutions. We design an
efficient and distributed online algorithm-MiniBDP that is able to minimize the long-term time-
averaged operation cost. It analyze the performance of MiniBDP in terms of cost optimality and
worst case delay. We show that the algorithm approximates the optimal solution within provable
bounds and guarantees that the data processing can be completed within pre-defined delays. It is
used to conduct extensive experiments to evaluate the performance of our online algorithm with
real world datasets. The experiments result demonstrate its effectiveness as well its superiority in
terms of cost, system stability and decision-making time to existing representative approaches
(e.g., the combinations of data allocation strategies (proximity-aware, load balance-aware) and
the resource provisioning strategies(e.g., stable strategy, heuristic strategy).
LITERATURE SURVEY
Authors : M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
Year : 2010.
MapReduce and its variants have been highly successful in implementing large-scale
data-intensive applications on commodity clusters. However, most of these systems are built
around an acyclic data flow model that is not suitable for other popular applications. This paper
focuses on one such class of applications: those that reuse a working set of data across multiple
parallel operations. This includes many iterative machine learning algorithms, as well as
interactive data analysis tools. We propose a new framework called Spark that supports these
applications while retaining the scalability and fault tolerance of MapReduce. To achieve these
goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a
read-only collection of objects partitioned across a set of machines that can be rebuilt if a
partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can
be used to interactively query a 39 GB dataset with sub-second response time.
Year : 2011.
Year : 2013.
Cloud computing, rapidly emerging as a new computation paradigm, provides agile and
scalable resource access in a utility-like fashion, especially for the processing of big data. An
important open issue here is to efficiently move the data, from different geographical locations
over time, into a cloud for effective processing. The de facto approach of hard drive shipping is
not flexible or secure. This work studies timely, cost-minimizing upload of massive,
dynamically-generated, geo-dispersed data into the cloud, for processing using a MapReduce-
like framework. Targeting at a cloud encompassing disparate data centers, we model a cost-
minimizing data migration problem, and propose two online algorithms: an online lazy migration
(OLM) algorithm and a randomized fixed horizon control (RFHC) algorithm , for optimizing at
any given time the choice of the data center for data aggregation and processing, as well as the
routes for transmitting data there. Careful comparisons among these online and offline
algorithms in realistic settings are conducted through extensive experiments, which demonstrate
close-to-offline-optimum performance of the online algorithms.
Year : 2015.
As new data and updates are constantly arriving, the results of data mining applications
become stale and obsolete over time. Incremental processing is a promising approach to refresh
mining results. It utilizes previously saved states to avoid the expense of re-computation from
scratch. In this paper, we propose i2MapReduce, a novel incremental processing extension to
MapReduce. Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs key-
value pair level incremental processing rather than task level re-computation, (ii) supports not
only one-step computation but also more sophisticated iterative computation, and (iii)
incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain
computation states. Experimental results on Amazon EC2 show significant performance
improvements of i2MapReduce compared to both plain and iterative MapReduce performing re-
computation.
Year : 2014
MapReduce has proven remarkably effective for a wide variety of data-intensive applications,
but it was designed to run on large single-site homogeneous clusters. Researchers have begun to
explore the extent to which the original MapReduce assumptions can be relaxed, including
skewed workloads, iterative applications, and heterogeneous computing environments. This
paper continues this exploration by applying MapReduce across geo-distributed data over geo-
distributed computation resources. Using Hadoop, we show that network and node heterogeneity
and the lack of data locality lead to poor performance, because the interaction of MapReduce
phases becomes pronounced in the presence of heterogeneous network behavior. To address
these problems, we take a two-pronged approach: We first develop a model-driven optimization
that serves as an oracle, providing high-level insights. We then apply these insights to design
cross-phase optimization techniques that we implement and demonstrate in a real-world
MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the
potential of these techniques as performance is improved by 7-18 percent depending on the
execution environment and application.
3. SYSTEM ANALYSIS
Existing System
Due to their nature, Eolic parks are commonly situated in rural zones or in the open sea,
which impose some problems in terms of accessibility. In this context, data acquisition and
management (which is often subject to some kind of high performance computing due to the
massive volumes of data involved) is generally performed by means of remote sensing
techniques .This fact, together with the great volume of data to be processed, the distributed
nature of such data, the heterogeneity of the data sources, and the different times of data
production (synchronous and asynchronous), together with the different data structures involved
(some structured but most of them not structured), and the fact that each node of the system can
act as both producer and consumer, make Eolic parks an ideal scenario for the application of big
data techniques for remote sensing. According to the definition provided in mainly three aspects
characterize big data: 1) the data are numerous; 2) the data cannot be categorized into regular
relational databases; and 3) data are generated, captured, and processed rapidly. The data
collected from Eolic parks conform to these characteristics; hence the application of big remote
sensing data processing techniques can provide an important asset for the management of Eolic
parks as a relevant source of renewable energy
Proposed System
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS
• System : I3 PROCESSOR
• Hard Disk : 300 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• Ram : 4 gb.
SOFTWARE REQUIREMENTS
SYSTEM DESIGN
SCOPE AND PURPOSE
The scope of system design is it describes any constraints (reference any trade-off
analyses conducted such, as resource use versus productivity, or conflicts with other systems)
and includes any assumptions made by the project team in developing the system design.
It also provide any contingencies that might arise in the design of the system that may
change the development direction. Possibilities include lack of interface agreements with
outside agencies or unstable architectures at the time this document is produced. Address any
possible workarounds or alternative plans.
OBJECTIVE
The Objective of System Design describes the system requirements, operating
environment, system and subsystem architecture, files and database design, input formats, output
layouts, human-machine interfaces, detailed design, processing logic, and external interfaces.
SYSTEM ARCHITECTURE
It is used to describe the overall system software and organization. Include a list of
software modules (this could include functions, subroutines, or classes), computer languages,
and programming computer-aided software engineering tools (with a brief description of the
function of each item). Use structured organization diagrams/object-oriented diagrams that show
the various segmentation levels down to the lowest level. All features on the diagrams should
have reference numbers and names. Include a narrative that expands on and enhances the
understanding of the functional breakdown. If appropriate, use subsections to address each
module.
USE CASE DIAGRAM
The sequence diagrams are an easy and intuitive way of describing the system’s behavior,
which focuses on the interaction between the system and the environment. This notational
diagram shows the interaction arranged in a time sequence. The sequence diagram has two
dimensions: the vertical dimension represents the time, the horizontal dimension represents
different objects. The vertical line also called the object’s lifeline represents the object’s
existence during the interaction.
STATE DIAGRAM
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement
UNIT TESTING
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components is
correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
SYSTEM IMPLEMENTATION
Implementation is the most crucial stage in achieving a successful system and giving the
users confidence that the new system is workable and effective. Implementation of a modified
application to replace an existing one. This type of conversation is relatively easy to handle,
provide there are no major changes in the system.
Each program is tested individually at the time of development using the data and has
verified that this program linked together in the way specified in the programs specification, the
computer system and its environment is tested to the satisfaction of the user. And so the system
is going to be implemented very soon. A simple operating procedure is included so that the user
can understand the different functions clearly and quickly.
Implementation is the stage of the project when the theoretical design is turned out into a
working system. Thus it can be considered to be the most critical stage in achieving a successful
new system and in giving the user, confidence that the new system will work and be effective.
The implementation stage involves careful planning, investigation of the existing system and its
constraints on implementation, designing of methods to achieve changeover methods.
SOFTWARE ENVIRONMENT
PURPOSE:
The concept of Big Data has been around for more than a decade – but while its potential
to transform the effectiveness, efficiency, and profitability of virtually any enterprise has
been well documented, the means to effectively leverage Big Data and realize its
promised benefits still eludes some organizations. Ultimately, there are two main hurdles
to tackle when it comes to realizing these benefits.
The first is realizing that the real purpose of leveraging Big Data is to take action – to
make more accurate decisions and to do so quickly. We call this situational awareness.
Regardless of industry or environment, situational awareness means having an
understanding of what you need to know, what you have control of, and conducting
analysis in real-time to identify anomalies in normal patterns or behaviors that can affect
the outcome of a business or process. If you have these things, making the right decision
within the right amount of time in any context becomes much easier.
Defining these parameters for any industry is not simple, and thus surmounting Big
Data’s other remaining challenge of creating new approaches to data management and
analysis is also no small feat. Achieving situational awareness used to be much easier
because data volumes were smaller, and new data was created at a slower rate, which
meant our world was defined by a much smaller amount of information. But new data is
now created at a hugely exponential rate, and therefore any data management and
analysis system that is built to provide situational awareness today must also be able to
do so tomorrow. So, the imperative for any enterprise is not to just create systems that
manage Big Data and provide situational awareness, but to build systems that
provide scalable situational awareness.
Take, for instance, the utilities industry. This space is in particular need of scalable
situational awareness so that they can realize benefits for a wide range of important
functions critical for enabling Smart Grid paradigms. A properly-functioning power grid
network shifts power around to where it is needed. Scalable situational awareness for
utilities then means knowing where power is needed, and where it can be taken from, to
keep the grid stable. When power flow is not well understood its direction will start
changing rapidly, moving energy around like a power hurricane.
As with any hurricane, at the middle there is an eye that is totally quiet and dark (a fitting,
although ironic, analogy considering the goal of awareness). This is what happened in
2003, during one of the worst blackouts to hit the Northeast. The various power
companies involved were quickly analyzing all of the information but, despite the fact
that they were all communicating, they didn’t know what exactly to do to alleviate the
drastic shift in power flow and, thus, ended up making the wrong decisions that resulted
in the blackout.
If situational awareness had been present, the blackout could have been prevented. This
seems especially relevant given the recent blackout in India, and begs the question: is the
U.S. aware enough of the potential dangers that we have taken steps to enable our smart
grid to respond in the correct way to avoid such outages?
Utilities in the U.S. and beyond can learn much about how to achieve scalable situational
awareness from other industries, most notably building management and
telecommunications, which have learned to deal with Big Data’s complexity and scale
well. For industries like utilities to achieve scalable situational awareness, it requires
building standards-based, interoperable, and scalable data management systems.
SCOPE:
1. Visual data discovery tools will be growing 2.5 times faster than rest of the Business
Intelligence (BI) market. By 2018, investing in this enabler of end-user self-service will become
a requirement for all enterprises.
2. Over the next five years spending on cloud-based Big Data and analytics (BDA) solutions will
grow three times faster than spending for on-premise solutions. Hybrid on/off premise
deployments will become a requirement.
3. Shortage of skilled staff will persist. In the U.S. alone there will be 181,000 deep analytics
roles in 2018 and five times that many positions requiring related skills in data management and
interpretation.
4. By 2017 unified data platform architecture will become the foundation of BDA strategy. The
unification will occur across information management, analysis, and search technology.
6. 70% of large organizations already purchase external data and 100% will do so by 2019. In
parallel more organizations will begin to monetize their data by selling them or providing value-
added content.
8. Decision management platforms will expand at a CAGR of 60% through 2019 in response to
the need for greater consistency in decision making and decision making process knowledge
retention.
9. Rich media (video, audio, image) analytics will at least triple in 2015 and emerge as the key
driver for BDA technology investment.
10.By 2018 half of all consumers will interact with services based on cognitive computing on a
regular basis.
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The
amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If
you pile up the data in the form of disks it may fill an entire football field. The same amount was
created in every two days in 2011, and in every ten minutes in 2013. This rate is still growing
enormously. Though all this information produced is meaningful and can be useful when
processed, it is being neglected.
Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data, rather it has
become a complete subject, which involves various tools, technqiues and frameworks.
Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures voices of
the flight crew, recordings of microphones and earphones, and the performance information of
the aircraft.
Social Media Data : Social media such as Facebook and Twitter hold information and the views
posted by millions of people across the globe.
Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and ‘sell’
decisions made on a share of different companies made by the customers.
Power Grid Data : The power grid data holds information consumed by a particular node with
respect to a base station.
Transport Data : Transport data includes model, capacity, distance and availability of a vehicle.
Search Engine Data : Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Hadoop is an Apache Software Foundation project that importantly provides two things:
HDFS
HDFS is structured similarly to a regular Unix filesystem except that data storage
is distributedacross several machines. It is not intended as a replacement to a regular filesystem,
but rather as a filesystem-like layer for large distributed systems to use. It has in built
mechanisms to handle machine outages, and is optimized for throughput rather than latency.
Datanode - where HDFS actually stores the data, there are usually quite a few of these.
Namenode - the ‘master’ machine. It controls all the meta data for the cluster. Eg - what blocks
make up a file, and what datanodes those blocks are stored on.
Secondary Namenode - this is NOT a backup namenode, but is a separate service that keeps a
copy of both the edit logs, and filesystem image, merging them periodically to keep the size
reasonable.
this is soon being deprecated in favor of the backup node and the checkpoint node, but the
functionality remains similar (if not the same)
Data can be accessed using either the Java API, or the Hadoop command line client. Many
operations are similar to their Unix counterparts. Check out the documentation page for the full
list, but here are some simple examples:
hadoop fs -ls /
hadoop fs -ls ./
Note that HDFS is optimized differently than a regular file system. It is designed for non-
realtime applications demanding high throughput instead of online applications demanding low
latency. For example, files cannot be modified once written, and the latency of reads/writes is
really bad by filesystem standards. On the flip side, throughput scales fairly linearly with the
number of datanodes in a cluster, so it can handle workloads no single machine would ever be
able to.
HDFS also has a bunch of unique features that make it ideal for distributed systems:
Failure tolerant - data can be duplicated across multiple datanodes to protect against machine
failures. The industry standard seems to be a replication factor of 3 (everything is stored on three
machines).
Scalability - data transfers happen directly with the datanodes so your read/write capacity scales
fairly well with the number of datanodes
Space - need more disk space? Just add more datanodes and re-balance
Industry standard - Lots of other distributed applications build on top of HDFS (HBase, Map-
Reduce)
HDFS Resources
For more information about the design of HDFS, you should read through apache
documentation page. In particular the streaming and data access section has some really simple
and informative diagrams on how data read/writes actually happen.
MapReduce
The second fundamental part of Hadoop is the MapReduce layer. This is made up of two
sub components:
words.foreach{word =>
output(word, 1)
var total = 0l
counts.foreach{count =>
total += count
output(word, total)
Notice that the output to a map and reduce task is always a KEY, VALUE pair. You
always output exactly one key, and one value. The input to a reduce is KEY,
ITERABLE[VALUE]. Reduce is calledexactly once for each key output by the map phase.
The ITERABLE[VALUE] is the set of all values output by the map phase for that key.
Counter intuitively, one of the most important parts of a MapReduce job is what
happens betweenmap and reduce, there are 3 other stages; Partitioning, Sorting, and Grouping. In
the default configuration, the goal of these intermediate steps is to ensure this behavior; that the
values for each key are grouped together ready for the reduce() function. APIs are also provided
if you want to tweak how these stages work (like if you want to perform a secondary sort).
Here’s a diagram of the full workflow to try and demonstrate how these pieces all fit
together, but really at this stage it’s more important to understand how map and reduce interact
rather than understanding all the specifics of how that is implemented.
What’s really powerful about this API is that there is no dependency between any two of
the same task. To do it’s job a map() task does not need to know about other map task, and
similarly a singlereduce() task has all the context it needs to aggregate for any particular key, it
does not share any state with other reduce tasks.
Taken as a whole, this design means that the stages of the pipeline can be easily
distributed to an arbitrary number of machines. Workflows requiring massive datasets can be
easily distributed across hundreds of machines because there are no inherent dependencies
between the tasks requiring them to be on the same machine.
If you want to learn more about MapReduce (generally, and within Hadoop) I
recommend you read the Google MapReduce paper, the Apache MapReduce documentation, or
maybe even the hadoop book. Performing a web search for MapReduce tutorials also offers a lot
of useful information.
To make things more interesting, many projects have been built on top of the MapReduce
API to ease the development of MapReduce workflows. For example Hive lets you write SQL to
query data on HDFS instead of Java. There are many more examples, so if you’re interested in
learning more about these frameworks, I’ve written a separate article about the most common
ones.
Hadoop MapReduce comes with two primary services for scheduling and running
MapReduce jobs. They are the Job Tracker (JT) and the Task Tracker (TT). Broadly speaking
the JT is the master and is in charge of allocating tasks to task trackers and scheduling these tasks
globally. A TT is in charge of running the Map and Reduce tasks themselves.
When running, each TT registers itself with the JT and reports the number of ‘map’ and
‘reduce’ slots it has available, the JT keeps a central registry of these across all TTs and allocates
them to jobs as required. When a task is completed, the TT re-registers that slot with the JT and
the process repeats.
Many things can go wrong in a big distributed system, so these services have some clever tricks
to ensure that your job finishes successfully:
Automatic retries - if a task fails, it is retried N times (usually 3) on different task trackers.
Data locality optimizations - if you co-locate a TT with a HDFS Datanode (which you should)
it will take advantage of data locality to make reading the data faster
Blacklisting a bad TT - if the JT detects that a TT has too many failed tasks, it will blacklist it.
No tasks will then be scheduled on this task tracker.
Speculative Execution - the JT can schedule the same task to run on several machines at the
same time, just in case some machines are slower than others. When one version finishes, the
others are killed.
Here’s a simple diagram of a typical deployment with TTs deployed alongside datanodes.
For more reading on the JobTracker and TaskTracker check out Wikipedia or the Hadoop
book. I find the apache documentation pretty confusing when just trying to understand these
things at a high level, so again doing a web-search can be pretty useful.
Wrap Up
I hope this introduction to Hadoop was useful. There is a lot of information on-line, but I
didn’t feel like anything described Hadoop at a high-level for beginners.
The Hadoop project is a good deal more complex and deep than I have represented and is
changing rapidly. For example, an initiative called MapReduce 2.0 provides a more general
purpose job scheduling and resource management layer called YARN, and there is an ever
growing range of non-MapReduce applications that run on top of HDFS, such as Cloudera
Impala.
CONSTRAINS:
A big data processing constraints on a low-power Hadoop cluster. Big Data processing
with Hadoop has been emerging recently, both on the computing cloud and enterprise
deployment. However, wide-spread security exploits may hurt the reputation of public clouds. If
Hadoop on the cloud is not an option, an organization has to build its own Hadoop clusters. But
having a data center is not worth for a small organization both in terms of building and operating
costs. Another viable solution is to build a cluster with low-cost ARM system-on-chip boards.
BIGDATA APPLICATIONS:
Big Data is slowly becoming ubiquitous. Every arena of business, health or general living
standards now can implement big data analytics. To put simply, Big Data is a field which can be
used in any zone whatsoever given that this large quantity of data can be harnessed to one‘s
advantage. The major scope of Big Data have been listed below.
Organizations worldwide are slowly and perpetually recognizing the importance of big
data analytics. From predicting customer purchasing behavior patterns to influencing them to
make purchases to detecting fraud and misuse which until very recently used to be an
incomprehensible task for most companies big data analytics is a one-stop solution. Business
experts should have the opportunity to question and interpret data according to their business
requirements irrespective of the complexity and volume of the data. In order to achieve this
requirement, data scientists need to efficiently visualize and present this data in a comprehensible
manner. Giants like Google, Facebook, Twitter, EBay, Wal-Mart etc., adopted data visualization
to ease complexity of handling data. Data visualization has shown immense positive outcomes in
such business organizations. Implementing data analytics and data visualization, enterprises can
finally begin to tap into the immense potential that Big data possesses and ensure greater return
on investments and business stability.
Healthcare is one of those arenas in which Big Data ought to have the maximum social
impact. Right from the diagnosis of potential health hazards in an individual to complex medical
research, big data is present in all aspects of it. Devices such as the Fitbit, Jawbone and the
Samsung Gear Fit allow the user to track and upload data. Soon enough such data will be
compiled and made available to doctors, which will aid them in the diagnosis.
MODULES
1.Stream processing frameworks
In the wide area data analytics, data is generated in a geo-distributed fashion and some
new constraints need to be considered, such as privacy concern.Stream processing processes one
data element or a small size of data in the stream at a time and the data are processed
immediately upon arrival. For stream processing, computations are relatively simple and
independent and it benefits from lower latency, typically seconds. The latency can be defined as
a delay between receiving the request and generating the response. We build datacenters
geographically with the purpose of achieving low latencies for local users. Nevertheless, as data
volumes keep increasing at a tremendous rate, it is still time consuming to transfer such
substantial amount of data across datacenters. Many cloud services have very stringent
requirements for latency, even a delay of one second can make a great difference.
Distributed execution is a strategy widely used in the Cost-Aware Big Data Processing
across. This strategy is to push computations down to local datacenters and then aggregate the
intermediate results to do further processing. We use a motivation example to show the high-
level idea of this strategy. A social network provider wants to get hot search words for every ten
minutes. Click logs and search logs are two kinds of input data sources. Click logs store web
server logs of user activities and search logs are the records of user requests for information.
Base data is born distributed across datacenters, what we want to do is to give our execution
strategy to minimize data traffic across different datacenters. If we use a centralized execution in
Fig. 4, then we will observe that data traffic across datacenters is 600 GB per day. However, if
we use a distributed execution strategy depicted. Then data size will be much smaller after pre-
processing in the local datacenters, data traffic across datacenters is only 5 GB per day
Hadoop is a batch processing framework and data to be processed are stored in the
HDFS, a powerful tool designed to manage large datasets with high
fault-tolerance. MapReduce, the heart of Hadoop, is a programming model that allows
processing a substantial amount of data in parallel. An example of the MapReduce model has
three major processing phases: Map, Shuffle, and Reduce.
Traditional relational database organizes data into rows and columns and stores the data in
tables. MapReduce uses a different way, it uses key/value pairs.
SCREENSHOTS
CONCLUSION
Thus in this paper, a methodical framework for effective data movement, resource
provisioning and reducer selection with the goal of cost minimization is developed. We balance
five types of cost: bandwidth cost, storage cost, computing cost, migration cost, and latency cost,
between the two MapReduce phases across datacenters. This complex cost optimization problem
is formulated into a joint stochastic integer nonlinear optimization problem by minimizing the
five cost factors simultaneously. By employing Lyapunov technique, we transform the original
problem into three independent sub problems that can be solved by designing an efficient online
algorithm MiniBDP to minimize the long-term time-average operation cost. We conduct
theoretical analysis to demonstrate the effectiveness of MiniBDP in terms of cost optimum and
worst case delay. We perform experimental evaluation using real-world trace dataset to validate
the theoretical result and the superiority of MiniBDP by compared it with existing typical
approaches and offline methods. The proposed approach is predicted to be with widespread
application prospects in those globally-serving companies since analyzing the geographically
dispersed datasets is an efficient way to support their marketing decision. As the subproblems in
the algorithm MiniBDP are with analytical or efficient solutions that guarantee the algorithm
running in an online manner, the proposed approach can be easily implemented in the real
system to reduce the operation cost.
REFERENCES
[1] “Square kilometre array,” http://www.skatelescope.org/.
[3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,”
Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[6] M. Cardosa, C. Wang, A. Nangia et al., “Exploring mapreduce efficiency with highly-
distributed data,” in Proceedings of the second international workshop on MapReduce and its
applications, 2011.
[7] L. Zhang, C.Wu, Z. Li, C. Guo, M. Chen, and F. C. M. Lau, “Moving big data to the cloud:
An online cost-minimizing approach,” IEEE Journal on Selected Areas in Communications, vol.
31, pp. 2710–2721, 2013.
[8] W. Yang, X. Liu, L. Zhang, and L. T. Yang, “Big data real-time processing based on storm,”
in Proceedings of the IEEE TrustCom’13, 2013.
[9] Y. Zhang, S. Chen, Q. Wang, and G. Yu, “i2mapreduce: Incremental mapreduce for mining
evolving big data,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 1906–
1919, 2015.
[10] D. Lee, J. S. Kim, and S. Maeng, “Large-scale incremental processing with mapreduce,”
Future Generation Computer Systems, vol. 36, no. 7, pp. 66–79, 2014.
[11] B. Heintz, A. Chandra, R. K. Sitaraman, and J. Weissman, “End-toend optimization for geo-
distributed mapreduce,” IEEE Transactions on Cloud Computing, 2014.
[12] C. Jayalath, J. Stephen, and P. Eugster, “From the cloud to the atmosphere: Running
mapreduce across data centers,” IEEE Transactions on Computers, vol. 63, no. 1, pp. 74–87,
2014.
[13] P. Li, S. Guo, S. Yu, and W. Zhuang, “Cross-cloud mapreduce for big data,” IEEE
Transactions on Cloud Computing, 2015,
dOI:10.1109/TCC.2015.2474385.
[14] A. Sfrent and F. Pop, “Asymptotic scheduling for many task computing in big data
platforms,” Information Sciences, vol. 319, pp. 71–91, 2015.
[15] L. Zhang, Z. Li, C. Wu, and M. Chen, “Online algorithms for uploading deferrable big data
to the cloud,” in Proceedings of the IEEE INFOCOM, 2014, pp. 2022–2030.
[18] K. Kloudas, M. Mamede, N. Preguica, and R. Rodrigues, “Pixida: Optimizing data parallel
jobs in wide-area data analytics,” Proceedings of the VLDB Endowment, vol. 9, no. 2, pp. 72–83,
2015.
[19] Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella, P. Bahl, and I. Stoica, “Low
latency geo-distributed data analytics,” in Proceedings of the ACM SIGCOMM’15, 2015.
[21] J. C. Corbett, J. Dean, M. Epstein et al., “Spanner: Google’s globally distributed database,”
in Proceedings of the OSDI’12, 2012.
[25] R. Urgaonkar, U. C. Kozat, K. Igarashi, and M. J. Neely, “Dynamic resource allocation and
power management in virtualized data centers,” in Proceedings of the IEEE NOMS, 2010, pp.
479–486.
[26] F. Liu, Z. Zhou, H. Jin, B. Li, B. Li, and H. Jiang, “On arbitrating the power-performance
tradeoff in saas clouds,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 10,
pp. 2648–2658, 2014.
[27] Y. Yao, L. Huang, A. Sharma, L. Golubchik, and M. Neely, “Power cost reduction in
distributed data centers: A two-time-scale approach for delay tolerant workloads,” IEEE
Transactions on Parallel and Distributed Systems, vol. 25, no. 1, pp. 200–211, 2014.
[28] D. Wu, Z. Xue, and J. He, “icloudaccess: Cost-effective streaming of video games from the
cloud with low latency,” IEEE Transactions on Circuits and Systems for Video Technology, vol.
24, no. 8, pp. 1405– 1416, 2014.
[29] W. Xiao, W. Bao, X. Zhu, C. Wang, L. Chen, and L. T. Yang, “Dynamic request redirection
and resource provisioning for cloud-based video services under heterogeneous environment,”
IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 7, pp. 1954–1967, 2016.
[30] S. Rao, R. Ramakrishnan, A. Silberstein et al., “Sailfish: a framework for large scale data
processing,” in Proceedings of the Third ACM Symposium on Cloud Computing, 2012.
[32] M. J. Neely, “Opportunistic scheduling with worst case delay guarantees in single and
multi-hop networks,” in Proceedings of the IEEE INFOCOM, 2011, pp. 1728–1736.
[33] M. Arlitt and T. Jin, “A workload characterization study of the 1998 world cup web site,”
IEEE Network, vol. 14, no. 3, pp. 30–37, 2000.