Você está na página 1de 33

ABSTRACT

Due to their nature, Eolic parks are situated in zones with troublesome access. Hence,
management of Eolic parks using remote sensing techniques is of nice importance. Further, the
large quantity of information managed by Eolic parks, along with their nature (distributed,
heterogeneous, created, consumed at different times, etc.) makes them ideal to use big
information techniques. During this paper, we tend to present a multilayer hardware/software
architecture that applies cloud computing techniques for managing massive data from Eolic
parks. This design allows tackling the processing of huge, distributed, and heterogeneous
information sets during a remote sensing context. An innovative contribution of this work is the
mix of various techniques at three different layers of the proposed hardware/software
architecture for Eolic park big knowledge management and processing.

1. INTRODUCTION

1.1 ABOUT THE PROJECT

With the increasing explosion of distributed data, the huge treasures hidden in it are
waiting for us to explore for providing valuable insights. To illustrate, social web sites such as
Facebook can uncover usage patterns and hidden correlations by analyzing the web site history
records (e.g., click records, activity records et al.) to detect social hot event and facilitate its
marketing decision (e.g., advertisement recommendation), and the Square Kilometre Array
(SKA) [1], an international project to build the world’s largest telescope distributed over several
countries, need to fusion the geographically dispersed data for scientific applications. However,
due to the properties such as large-scale volume, high complexity and dispersiveness of big data
coupled with the scarcity of Widearea bandwidth (e.g., trans-oceanic link ), it is inefficient
and/or infeasible to process the data with centralized solutions [2]. This has fueled strong
companies from industry to deploy multidatacenter cloud and hybrid cloud. These cloud
technologies offer a powerful and cost-effective solution to deal with increasingly high velocity
of big data generated from geo-distributed sources (e.g., Facebook, Google and Microsoft etc).

Processing big data across geo-distributed datacenters continues to gain popularity in


recent years. However, managing distributed MapReduce computations across geo-distributed
datacenters poses a number of technical challenges: how to allocate data among a selection of
geo-distributed datacenters to reduce the communication cost, how to determine the VM (Virtual
Machine) provisioning strategy that offers high performance and low cost, and what criteria
should be used to select a datacenter as the final reducer for big data analytics jobs. In this paper,
these challenges is addressed by balancing bandwidth cost, storage cost, computing cost,
migration cost, and latency cost, between the two MapReduce phases across datacenters. We
formulate this complex cost optimization problem for data movement, resource provisioning and
reducer selection into a joint stochastic integer nonlinear optimization problem by minimizing
the five cost factors simultaneously.

1.2 OBJECTIVE

In this work a framework has been proposed that can systematically handle the issues of
data movement, resource provisioning as well as reducer selection under the context of running
MapReduce across multiple datacenters, and VMs of different types and dynamic prices. It is
used to formulate the complex cost optimization problem as a jointed stochastic integer
nonlinear optimization problem and solve it using Lyapunov optimization framework by
transforming the original problem into three independent subproblems (data movement, resource
provisioning and reduce selection) that can be solved with some simple solutions. We design an
efficient and distributed online algorithm-MiniBDP that is able to minimize the long-term time-
averaged operation cost. It analyze the performance of MiniBDP in terms of cost optimality and
worst case delay. We show that the algorithm approximates the optimal solution within provable
bounds and guarantees that the data processing can be completed within pre-defined delays. It is
used to conduct extensive experiments to evaluate the performance of our online algorithm with
real world datasets. The experiments result demonstrate its effectiveness as well its superiority in
terms of cost, system stability and decision-making time to existing representative approaches
(e.g., the combinations of data allocation strategies (proximity-aware, load balance-aware) and
the resource provisioning strategies(e.g., stable strategy, heuristic strategy).
LITERATURE SURVEY
Authors : M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,

Title : “Spark: Cluster computing with working sets,”

Year : 2010.

MapReduce and its variants have been highly successful in implementing large-scale
data-intensive applications on commodity clusters. However, most of these systems are built
around an acyclic data flow model that is not suitable for other popular applications. This paper
focuses on one such class of applications: those that reuse a working set of data across multiple
parallel operations. This includes many iterative machine learning algorithms, as well as
interactive data analysis tools. We propose a new framework called Spark that supports these
applications while retaining the scalability and fault tolerance of MapReduce. To achieve these
goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a
read-only collection of objects partitioned across a set of machines that can be rebuilt if a
partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can
be used to interactively query a 39 GB dataset with sub-second response time.

Authors : M. Cardosa, C. Wang, A. Nangia et al.,

Title : “Exploring mapreduce efficiency with highly-distributed data,”

Year : 2011.

MapReduce is a highly-popular paradigm for high-performance computing over large


data sets in large-scale platforms. However, when the source data is widely distributed and the
computing platform is also distributed, e.g. data is collected in separate data center locations, the
most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial.
In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for
situations when data and compute resources are widely distributed. Further, we provide
recommendations for alternative (and even hierarchical) distributed MapReduce setup
configurations, depending on the workload and data set.
Authors : L. Zhang, C.Wu, Z. Li, C. Guo, M. Chen, and F. C. M. Lau,

Title : Moving big data to the cloud: An online cost-minimizing approach

Year : 2013.

Cloud computing, rapidly emerging as a new computation paradigm, provides agile and
scalable resource access in a utility-like fashion, especially for the processing of big data. An
important open issue here is to efficiently move the data, from different geographical locations
over time, into a cloud for effective processing. The de facto approach of hard drive shipping is
not flexible or secure. This work studies timely, cost-minimizing upload of massive,
dynamically-generated, geo-dispersed data into the cloud, for processing using a MapReduce-
like framework. Targeting at a cloud encompassing disparate data centers, we model a cost-
minimizing data migration problem, and propose two online algorithms: an online lazy migration
(OLM) algorithm and a randomized fixed horizon control (RFHC) algorithm , for optimizing at
any given time the choice of the data center for data aggregation and processing, as well as the
routes for transmitting data there. Careful comparisons among these online and offline
algorithms in realistic settings are conducted through extensive experiments, which demonstrate
close-to-offline-optimum performance of the online algorithms.

Authors : Y. Zhang, S. Chen, Q. Wang, and G. Yu,

Title : “i2mapreduce: Incremental mapreduce for mining evolving big data,”

Year : 2015.

As new data and updates are constantly arriving, the results of data mining applications
become stale and obsolete over time. Incremental processing is a promising approach to refresh
mining results. It utilizes previously saved states to avoid the expense of re-computation from
scratch. In this paper, we propose i2MapReduce, a novel incremental processing extension to
MapReduce. Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs key-
value pair level incremental processing rather than task level re-computation, (ii) supports not
only one-step computation but also more sophisticated iterative computation, and (iii)
incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain
computation states. Experimental results on Amazon EC2 show significant performance
improvements of i2MapReduce compared to both plain and iterative MapReduce performing re-
computation.

Authors : B. Heintz, A. Chandra, R. K. Sitaraman, and J. Weissman

Title : End-to end optimization for geo-distributed mapreduce

Year : 2014

MapReduce has proven remarkably effective for a wide variety of data-intensive applications,
but it was designed to run on large single-site homogeneous clusters. Researchers have begun to
explore the extent to which the original MapReduce assumptions can be relaxed, including
skewed workloads, iterative applications, and heterogeneous computing environments. This
paper continues this exploration by applying MapReduce across geo-distributed data over geo-
distributed computation resources. Using Hadoop, we show that network and node heterogeneity
and the lack of data locality lead to poor performance, because the interaction of MapReduce
phases becomes pronounced in the presence of heterogeneous network behavior. To address
these problems, we take a two-pronged approach: We first develop a model-driven optimization
that serves as an oracle, providing high-level insights. We then apply these insights to design
cross-phase optimization techniques that we implement and demonstrate in a real-world
MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the
potential of these techniques as performance is improved by 7-18 percent depending on the
execution environment and application.
3. SYSTEM ANALYSIS
Existing System

Due to their nature, Eolic parks are commonly situated in rural zones or in the open sea,
which impose some problems in terms of accessibility. In this context, data acquisition and
management (which is often subject to some kind of high performance computing due to the
massive volumes of data involved) is generally performed by means of remote sensing
techniques .This fact, together with the great volume of data to be processed, the distributed
nature of such data, the heterogeneity of the data sources, and the different times of data
production (synchronous and asynchronous), together with the different data structures involved
(some structured but most of them not structured), and the fact that each node of the system can
act as both producer and consumer, make Eolic parks an ideal scenario for the application of big
data techniques for remote sensing. According to the definition provided in mainly three aspects
characterize big data: 1) the data are numerous; 2) the data cannot be categorized into regular
relational databases; and 3) data are generated, captured, and processed rapidly. The data
collected from Eolic parks conform to these characteristics; hence the application of big remote
sensing data processing techniques can provide an important asset for the management of Eolic
parks as a relevant source of renewable energy

Proposed System

In this paper, we develop a new multilayer software/hardware architecture for controlling


and monitoring Eolic parks. This architecture which is applied to a case study allows tackling the
processing of large, distributed and heterogeneous (regarding its source, formats, and production
velocity) data sets in a remote sensing context. In the presented case study, we define three
different levels for data management: wind turbine, substation, and control center. An important
contribution (from a big data processing perspective) is given by different strategies that are
specifically adopted in each layer. While the wind turbine level makes use of low cost devices
for optimizing data management and transfer, the control center layer makes use of advanced
features such as cloud computing provided by the infrastructure of Amazon Web services
(AWSs).

SYSTEM REQUIREMENTS:

HARDWARE REQUIREMENTS

• System : I3 PROCESSOR
• Hard Disk : 300 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• Ram : 4 gb.

SOFTWARE REQUIREMENTS

• Operating system : Windows XP/Ubuntu


• Coding Language : java for Maper and Reducer
• Front End : Php,Javascript (Intelligent Graph)
• Back End : Hadoop Cluster
• Tool : Virtual Box Oracle tool

SYSTEM DESIGN
SCOPE AND PURPOSE
The scope of system design is it describes any constraints (reference any trade-off
analyses conducted such, as resource use versus productivity, or conflicts with other systems)
and includes any assumptions made by the project team in developing the system design.

It also provide any contingencies that might arise in the design of the system that may
change the development direction. Possibilities include lack of interface agreements with
outside agencies or unstable architectures at the time this document is produced. Address any
possible workarounds or alternative plans.

OBJECTIVE
The Objective of System Design describes the system requirements, operating
environment, system and subsystem architecture, files and database design, input formats, output
layouts, human-machine interfaces, detailed design, processing logic, and external interfaces.

STRUCTURE OF DESIGN DOCUMENT

SYSTEM ARCHITECTURE

It is used to describe the overall system software and organization. Include a list of
software modules (this could include functions, subroutines, or classes), computer languages,
and programming computer-aided software engineering tools (with a brief description of the
function of each item). Use structured organization diagrams/object-oriented diagrams that show
the various segmentation levels down to the lowest level. All features on the diagrams should
have reference numbers and names. Include a narrative that expands on and enhances the
understanding of the functional breakdown. If appropriate, use subsections to address each
module.
USE CASE DIAGRAM

A use-case diagram is a graph of actors, a set of use cases enclosed by a system


boundary, participation associations between the actors and the use-cases, and generalization
among the use cases. In general, the use-case defines the outside (actors) and inside(use-case) of
the system’s typical behavior. A use-case is shown as an ellipse containing the name of the use-
case and is initiated by actors. An Actor is anything that interacts with a use-case. This is
symbolized by a stick figure with the name of the actor below the figure.
CLASS DIAGRAM

In software engineering, a class diagram in the Unified Modeling Language (UML) is a


type of static structure diagram that describes the structure of a system by showing the
system's classes, their attributes, operations (or methods), and the relationships among objects.
SEQUENCE DIAGRAM

The sequence diagrams are an easy and intuitive way of describing the system’s behavior,
which focuses on the interaction between the system and the environment. This notational
diagram shows the interaction arranged in a time sequence. The sequence diagram has two
dimensions: the vertical dimension represents the time, the horizontal dimension represents
different objects. The vertical line also called the object’s lifeline represents the object’s
existence during the interaction.
STATE DIAGRAM

State diagrams are graphical representations of workflows of stepwise activities and


actions with support for choice, iteration and concurrency. In the Unified Modeling
Language, State diagrams are intended to model both computational and organizational
processes.
SYSTEM TESTING

The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement

UNIT TESTING

Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.

Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.

Test strategy and approach


Field testing will be performed manually and functional tests will be written in detail.
Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.
INTEGRATION TESTING

Software integration testing is the incremental integration testing of two or more


integrated software components on a single platform to produce failures caused by interface
defects.

The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.

Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components is
correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.

SYSTEM IMPLEMENTATION

Implementation is the most crucial stage in achieving a successful system and giving the
users confidence that the new system is workable and effective. Implementation of a modified
application to replace an existing one. This type of conversation is relatively easy to handle,
provide there are no major changes in the system.

Each program is tested individually at the time of development using the data and has
verified that this program linked together in the way specified in the programs specification, the
computer system and its environment is tested to the satisfaction of the user. And so the system
is going to be implemented very soon. A simple operating procedure is included so that the user
can understand the different functions clearly and quickly.

Implementation is the stage of the project when the theoretical design is turned out into a
working system. Thus it can be considered to be the most critical stage in achieving a successful
new system and in giving the user, confidence that the new system will work and be effective.
The implementation stage involves careful planning, investigation of the existing system and its
constraints on implementation, designing of methods to achieve changeover methods.
SOFTWARE ENVIRONMENT

PURPOSE:

 The concept of Big Data has been around for more than a decade – but while its potential
to transform the effectiveness, efficiency, and profitability of virtually any enterprise has
been well documented, the means to effectively leverage Big Data and realize its
promised benefits still eludes some organizations. Ultimately, there are two main hurdles
to tackle when it comes to realizing these benefits.
 The first is realizing that the real purpose of leveraging Big Data is to take action – to
make more accurate decisions and to do so quickly. We call this situational awareness.
Regardless of industry or environment, situational awareness means having an
understanding of what you need to know, what you have control of, and conducting
analysis in real-time to identify anomalies in normal patterns or behaviors that can affect
the outcome of a business or process. If you have these things, making the right decision
within the right amount of time in any context becomes much easier.
 Defining these parameters for any industry is not simple, and thus surmounting Big
Data’s other remaining challenge of creating new approaches to data management and
analysis is also no small feat. Achieving situational awareness used to be much easier
because data volumes were smaller, and new data was created at a slower rate, which
meant our world was defined by a much smaller amount of information. But new data is
now created at a hugely exponential rate, and therefore any data management and
analysis system that is built to provide situational awareness today must also be able to
do so tomorrow. So, the imperative for any enterprise is not to just create systems that
manage Big Data and provide situational awareness, but to build systems that
provide scalable situational awareness.
 Take, for instance, the utilities industry. This space is in particular need of scalable
situational awareness so that they can realize benefits for a wide range of important
functions critical for enabling Smart Grid paradigms. A properly-functioning power grid
network shifts power around to where it is needed. Scalable situational awareness for
utilities then means knowing where power is needed, and where it can be taken from, to
keep the grid stable. When power flow is not well understood its direction will start
changing rapidly, moving energy around like a power hurricane.

 As with any hurricane, at the middle there is an eye that is totally quiet and dark (a fitting,
although ironic, analogy considering the goal of awareness). This is what happened in
2003, during one of the worst blackouts to hit the Northeast. The various power
companies involved were quickly analyzing all of the information but, despite the fact
that they were all communicating, they didn’t know what exactly to do to alleviate the
drastic shift in power flow and, thus, ended up making the wrong decisions that resulted
in the blackout.
 If situational awareness had been present, the blackout could have been prevented. This
seems especially relevant given the recent blackout in India, and begs the question: is the
U.S. aware enough of the potential dangers that we have taken steps to enable our smart
grid to respond in the correct way to avoid such outages?
 Utilities in the U.S. and beyond can learn much about how to achieve scalable situational
awareness from other industries, most notably building management and
telecommunications, which have learned to deal with Big Data’s complexity and scale
well. For industries like utilities to achieve scalable situational awareness, it requires
building standards-based, interoperable, and scalable data management systems.

SCOPE:

The Scope of the Big data and Analytics are:

1. Visual data discovery tools will be growing 2.5 times faster than rest of the Business
Intelligence (BI) market. By 2018, investing in this enabler of end-user self-service will become
a requirement for all enterprises.

2. Over the next five years spending on cloud-based Big Data and analytics (BDA) solutions will
grow three times faster than spending for on-premise solutions. Hybrid on/off premise
deployments will become a requirement.
3. Shortage of skilled staff will persist. In the U.S. alone there will be 181,000 deep analytics
roles in 2018 and five times that many positions requiring related skills in data management and
interpretation.

4. By 2017 unified data platform architecture will become the foundation of BDA strategy. The
unification will occur across information management, analysis, and search technology.

5. Growth in applications incorporating advanced and predictive analytics, including machine


learning, will accelerate in 2015. These apps will grow 65% faster than apps without predictive
functionality.

6. 70% of large organizations already purchase external data and 100% will do so by 2019. In
parallel more organizations will begin to monetize their data by selling them or providing value-
added content.

7. Adoption of technology to continuously analyze streams of events will accelerate in 2015 as it


is applied to Internet of Things (IoT) analytics, which is expected to grow at a five-year
compound annual growth rate (CAGR) of 30%.

8. Decision management platforms will expand at a CAGR of 60% through 2019 in response to
the need for greater consistency in decision making and decision making process knowledge
retention.

9. Rich media (video, audio, image) analytics will at least triple in 2015 and emerge as the key
driver for BDA technology investment.

10.By 2018 half of all consumers will interact with services based on cognitive computing on a
regular basis.

OVERVIEW OF HADOOP-BIG DATA:

Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The
amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If
you pile up the data in the form of disks it may fill an entire football field. The same amount was
created in every two days in 2011, and in every ten minutes in 2013. This rate is still growing
enormously. Though all this information produced is meaningful and can be useful when
processed, it is being neglected.

What is Big Data?

Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data, rather it has
become a complete subject, which involves various tools, technqiues and frameworks.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.

Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures voices of
the flight crew, recordings of microphones and earphones, and the performance information of
the aircraft.

Social Media Data : Social media such as Facebook and Twitter hold information and the views
posted by millions of people across the globe.

Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and ‘sell’
decisions made on a share of different companies made by the customers.

Power Grid Data : The power grid data holds information consumed by a particular node with
respect to a base station.

Transport Data : Transport data includes model, capacity, distance and availability of a vehicle.

Search Engine Data : Search engines retrieve lots of data from different databases.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.

Structured data : Relational data.

Semi Structured data : XML data.


Unstructured data : Word, PDF, Text, Media Logs.

GENERAL DESCRIPTION OF HADOOP:

Hadoop is an Apache Software Foundation project that importantly provides two things:

A distributed filesystem called HDFS (Hadoop Distributed File System)

A framework and API for building and running MapReduce jobs

HDFS

HDFS is structured similarly to a regular Unix filesystem except that data storage
is distributedacross several machines. It is not intended as a replacement to a regular filesystem,
but rather as a filesystem-like layer for large distributed systems to use. It has in built
mechanisms to handle machine outages, and is optimized for throughput rather than latency.

There are two and a half types of machine in a HDFS cluster:

Datanode - where HDFS actually stores the data, there are usually quite a few of these.

Namenode - the ‘master’ machine. It controls all the meta data for the cluster. Eg - what blocks
make up a file, and what datanodes those blocks are stored on.

Secondary Namenode - this is NOT a backup namenode, but is a separate service that keeps a
copy of both the edit logs, and filesystem image, merging them periodically to keep the size
reasonable.

this is soon being deprecated in favor of the backup node and the checkpoint node, but the
functionality remains similar (if not the same)
Data can be accessed using either the Java API, or the Hadoop command line client. Many
operations are similar to their Unix counterparts. Check out the documentation page for the full
list, but here are some simple examples:

list files in the root directory

hadoop fs -ls /

list files in my home directory

hadoop fs -ls ./

cat a file (decompressing if needed)

hadoop fs -text ./file.txt.gz

upload and retrieve a file

hadoop fs -put ./localfile.txt /home/matthew/remotefile.txt

hadoop fs -get /home/matthew/remotefile.txt ./local/file/path/file.txt

Note that HDFS is optimized differently than a regular file system. It is designed for non-
realtime applications demanding high throughput instead of online applications demanding low
latency. For example, files cannot be modified once written, and the latency of reads/writes is
really bad by filesystem standards. On the flip side, throughput scales fairly linearly with the
number of datanodes in a cluster, so it can handle workloads no single machine would ever be
able to.

HDFS also has a bunch of unique features that make it ideal for distributed systems:

Failure tolerant - data can be duplicated across multiple datanodes to protect against machine
failures. The industry standard seems to be a replication factor of 3 (everything is stored on three
machines).
Scalability - data transfers happen directly with the datanodes so your read/write capacity scales
fairly well with the number of datanodes

Space - need more disk space? Just add more datanodes and re-balance

Industry standard - Lots of other distributed applications build on top of HDFS (HBase, Map-
Reduce)

Pairs well with MapReduce - As we shall learn

HDFS Resources

For more information about the design of HDFS, you should read through apache
documentation page. In particular the streaming and data access section has some really simple
and informative diagrams on how data read/writes actually happen.

MapReduce

The second fundamental part of Hadoop is the MapReduce layer. This is made up of two
sub components:

An API for writing MapReduce workflows in Java.

A set of services for managing the execution of these workflows.

THE MAP AND REDUCE APIS

The basic premise is this:

Map tasks perform a transformation.

Reduce tasks perform an aggregation.

In scala, a simplified version of a MapReduce job might look like this:

def map(lineNumber: Long, sentance: String) = {

val words = sentance.split()

words.foreach{word =>
output(word, 1)

def reduce(word: String, counts: Iterable[Long]) = {

var total = 0l

counts.foreach{count =>

total += count

output(word, total)

Notice that the output to a map and reduce task is always a KEY, VALUE pair. You
always output exactly one key, and one value. The input to a reduce is KEY,
ITERABLE[VALUE]. Reduce is calledexactly once for each key output by the map phase.
The ITERABLE[VALUE] is the set of all values output by the map phase for that key.

So if you had map tasks that output

map1: key: foo, value: 1

map2: key: foo, value: 32

Your reducer would receive:

key: foo, values: [1, 32]

Counter intuitively, one of the most important parts of a MapReduce job is what
happens betweenmap and reduce, there are 3 other stages; Partitioning, Sorting, and Grouping. In
the default configuration, the goal of these intermediate steps is to ensure this behavior; that the
values for each key are grouped together ready for the reduce() function. APIs are also provided
if you want to tweak how these stages work (like if you want to perform a secondary sort).

Here’s a diagram of the full workflow to try and demonstrate how these pieces all fit
together, but really at this stage it’s more important to understand how map and reduce interact
rather than understanding all the specifics of how that is implemented.

What’s really powerful about this API is that there is no dependency between any two of
the same task. To do it’s job a map() task does not need to know about other map task, and
similarly a singlereduce() task has all the context it needs to aggregate for any particular key, it
does not share any state with other reduce tasks.

Taken as a whole, this design means that the stages of the pipeline can be easily
distributed to an arbitrary number of machines. Workflows requiring massive datasets can be
easily distributed across hundreds of machines because there are no inherent dependencies
between the tasks requiring them to be on the same machine.

MapReduce API Resources

If you want to learn more about MapReduce (generally, and within Hadoop) I
recommend you read the Google MapReduce paper, the Apache MapReduce documentation, or
maybe even the hadoop book. Performing a web search for MapReduce tutorials also offers a lot
of useful information.

To make things more interesting, many projects have been built on top of the MapReduce
API to ease the development of MapReduce workflows. For example Hive lets you write SQL to
query data on HDFS instead of Java. There are many more examples, so if you’re interested in
learning more about these frameworks, I’ve written a separate article about the most common
ones.

THE HADOOP SERVICES FOR EXECUTING MAPREDUCE JOBS

Hadoop MapReduce comes with two primary services for scheduling and running
MapReduce jobs. They are the Job Tracker (JT) and the Task Tracker (TT). Broadly speaking
the JT is the master and is in charge of allocating tasks to task trackers and scheduling these tasks
globally. A TT is in charge of running the Map and Reduce tasks themselves.

When running, each TT registers itself with the JT and reports the number of ‘map’ and
‘reduce’ slots it has available, the JT keeps a central registry of these across all TTs and allocates
them to jobs as required. When a task is completed, the TT re-registers that slot with the JT and
the process repeats.

Many things can go wrong in a big distributed system, so these services have some clever tricks
to ensure that your job finishes successfully:

Automatic retries - if a task fails, it is retried N times (usually 3) on different task trackers.

Data locality optimizations - if you co-locate a TT with a HDFS Datanode (which you should)
it will take advantage of data locality to make reading the data faster

Blacklisting a bad TT - if the JT detects that a TT has too many failed tasks, it will blacklist it.
No tasks will then be scheduled on this task tracker.

Speculative Execution - the JT can schedule the same task to run on several machines at the
same time, just in case some machines are slower than others. When one version finishes, the
others are killed.
Here’s a simple diagram of a typical deployment with TTs deployed alongside datanodes.

MapReduce Service Resources

For more reading on the JobTracker and TaskTracker check out Wikipedia or the Hadoop
book. I find the apache documentation pretty confusing when just trying to understand these
things at a high level, so again doing a web-search can be pretty useful.

Wrap Up

I hope this introduction to Hadoop was useful. There is a lot of information on-line, but I
didn’t feel like anything described Hadoop at a high-level for beginners.

The Hadoop project is a good deal more complex and deep than I have represented and is
changing rapidly. For example, an initiative called MapReduce 2.0 provides a more general
purpose job scheduling and resource management layer called YARN, and there is an ever
growing range of non-MapReduce applications that run on top of HDFS, such as Cloudera
Impala.

CONSTRAINS:

A big data processing constraints on a low-power Hadoop cluster. Big Data processing
with Hadoop has been emerging recently, both on the computing cloud and enterprise
deployment. However, wide-spread security exploits may hurt the reputation of public clouds. If
Hadoop on the cloud is not an option, an organization has to build its own Hadoop clusters. But
having a data center is not worth for a small organization both in terms of building and operating
costs. Another viable solution is to build a cluster with low-cost ARM system-on-chip boards.

BIGDATA APPLICATIONS:

Big Data is slowly becoming ubiquitous. Every arena of business, health or general living
standards now can implement big data analytics. To put simply, Big Data is a field which can be
used in any zone whatsoever given that this large quantity of data can be harnessed to one‘s
advantage. The major scope of Big Data have been listed below.

The Third Eye- Data Visualization:

Organizations worldwide are slowly and perpetually recognizing the importance of big
data analytics. From predicting customer purchasing behavior patterns to influencing them to
make purchases to detecting fraud and misuse which until very recently used to be an
incomprehensible task for most companies big data analytics is a one-stop solution. Business
experts should have the opportunity to question and interpret data according to their business
requirements irrespective of the complexity and volume of the data. In order to achieve this
requirement, data scientists need to efficiently visualize and present this data in a comprehensible
manner. Giants like Google, Facebook, Twitter, EBay, Wal-Mart etc., adopted data visualization
to ease complexity of handling data. Data visualization has shown immense positive outcomes in
such business organizations. Implementing data analytics and data visualization, enterprises can
finally begin to tap into the immense potential that Big data possesses and ensure greater return
on investments and business stability.

Integration- An exigency of the 21st century:

Integrating digital capabilities in decision-making of an organization is transforming


enterprises. By transforming the processes, such companies are developing agility, flexibility and
precision that enables new growth. Gartner described the confluence of mobile devices, social
networks, cloud services and big data analytics as the as nexus of forces. Using social and mobile
technologies to alter the way people connect and interact with the organizations and
incorporating big data analytics in this process is proving to be a boon for organizations
implementing it. Using this concept, enterprises are finding ways to leverage the data better
either to increase revenues or to cut costs even if most of it is still focused on customer-centric
outcomes. Such customer-centric objectives may still be the primary concern of most companies,
a gradual shift to integrating big data technologies into the background operations and internal
processes.

Big Data in Healthcare:

Healthcare is one of those arenas in which Big Data ought to have the maximum social
impact. Right from the diagnosis of potential health hazards in an individual to complex medical
research, big data is present in all aspects of it. Devices such as the Fitbit, Jawbone and the
Samsung Gear Fit allow the user to track and upload data. Soon enough such data will be
compiled and made available to doctors, which will aid them in the diagnosis.

MODULES
1.Stream processing frameworks

In the wide area data analytics, data is generated in a geo-distributed fashion and some
new constraints need to be considered, such as privacy concern.Stream processing processes one
data element or a small size of data in the stream at a time and the data are processed
immediately upon arrival. For stream processing, computations are relatively simple and
independent and it benefits from lower latency, typically seconds. The latency can be defined as
a delay between receiving the request and generating the response. We build datacenters
geographically with the purpose of achieving low latencies for local users. Nevertheless, as data
volumes keep increasing at a tremendous rate, it is still time consuming to transfer such
substantial amount of data across datacenters. Many cloud services have very stringent
requirements for latency, even a delay of one second can make a great difference.

2. Cost-Aware Big Data Processing

Distributed execution is a strategy widely used in the Cost-Aware Big Data Processing
across. This strategy is to push computations down to local datacenters and then aggregate the
intermediate results to do further processing. We use a motivation example to show the high-
level idea of this strategy. A social network provider wants to get hot search words for every ten
minutes. Click logs and search logs are two kinds of input data sources. Click logs store web
server logs of user activities and search logs are the records of user requests for information.
Base data is born distributed across datacenters, what we want to do is to give our execution
strategy to minimize data traffic across different datacenters. If we use a centralized execution in
Fig. 4, then we will observe that data traffic across datacenters is 600 GB per day. However, if
we use a distributed execution strategy depicted. Then data size will be much smaller after pre-
processing in the local datacenters, data traffic across datacenters is only 5 GB per day

3. Batch processing frameworks

Hadoop is a batch processing framework and data to be processed are stored in the
HDFS, a powerful tool designed to manage large datasets with high
fault-tolerance. MapReduce, the heart of Hadoop, is a programming model that allows
processing a substantial amount of data in parallel. An example of the MapReduce model has
three major processing phases: Map, Shuffle, and Reduce.
Traditional relational database organizes data into rows and columns and stores the data in
tables. MapReduce uses a different way, it uses key/value pairs.

SCREENSHOTS
CONCLUSION

Thus in this paper, a methodical framework for effective data movement, resource
provisioning and reducer selection with the goal of cost minimization is developed. We balance
five types of cost: bandwidth cost, storage cost, computing cost, migration cost, and latency cost,
between the two MapReduce phases across datacenters. This complex cost optimization problem
is formulated into a joint stochastic integer nonlinear optimization problem by minimizing the
five cost factors simultaneously. By employing Lyapunov technique, we transform the original
problem into three independent sub problems that can be solved by designing an efficient online
algorithm MiniBDP to minimize the long-term time-average operation cost. We conduct
theoretical analysis to demonstrate the effectiveness of MiniBDP in terms of cost optimum and
worst case delay. We perform experimental evaluation using real-world trace dataset to validate
the theoretical result and the superiority of MiniBDP by compared it with existing typical
approaches and offline methods. The proposed approach is predicted to be with widespread
application prospects in those globally-serving companies since analyzing the geographically
dispersed datasets is an efficient way to support their marketing decision. As the subproblems in
the algorithm MiniBDP are with analytical or efficient solutions that guarantee the algorithm
running in an online manner, the proposed approach can be easily implemented in the real
system to reduce the operation cost.

REFERENCES
[1] “Square kilometre array,” http://www.skatelescope.org/.

[2] A. Vulimiri, C. Curino, B. Godfrey, T. Jungblut, J. Padhye, and G. Varghese, “Global


analytics in the face of bandwidth and regulatory constraints,” in Proceedings of the USENIX
NSDI’15, 2015.

[3] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,”
Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

[4] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster


computing with working sets,” in Proceedings of the USENIX HotCloud’10, 2010.

[5] E. E. Schadt, M. D. Linderman, J. Sorenson, L. Lee, and G. P.Nolan, “Computational


solutions to large-scale data management and analysis,” Nature Reviews Genetics, vol. 11, no. 9,
pp. 647–657, 2010.

[6] M. Cardosa, C. Wang, A. Nangia et al., “Exploring mapreduce efficiency with highly-
distributed data,” in Proceedings of the second international workshop on MapReduce and its
applications, 2011.

[7] L. Zhang, C.Wu, Z. Li, C. Guo, M. Chen, and F. C. M. Lau, “Moving big data to the cloud:
An online cost-minimizing approach,” IEEE Journal on Selected Areas in Communications, vol.
31, pp. 2710–2721, 2013.

[8] W. Yang, X. Liu, L. Zhang, and L. T. Yang, “Big data real-time processing based on storm,”
in Proceedings of the IEEE TrustCom’13, 2013.
[9] Y. Zhang, S. Chen, Q. Wang, and G. Yu, “i2mapreduce: Incremental mapreduce for mining
evolving big data,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 1906–
1919, 2015.

[10] D. Lee, J. S. Kim, and S. Maeng, “Large-scale incremental processing with mapreduce,”
Future Generation Computer Systems, vol. 36, no. 7, pp. 66–79, 2014.

[11] B. Heintz, A. Chandra, R. K. Sitaraman, and J. Weissman, “End-toend optimization for geo-
distributed mapreduce,” IEEE Transactions on Cloud Computing, 2014.

[12] C. Jayalath, J. Stephen, and P. Eugster, “From the cloud to the atmosphere: Running
mapreduce across data centers,” IEEE Transactions on Computers, vol. 63, no. 1, pp. 74–87,
2014.

[13] P. Li, S. Guo, S. Yu, and W. Zhuang, “Cross-cloud mapreduce for big data,” IEEE
Transactions on Cloud Computing, 2015,
dOI:10.1109/TCC.2015.2474385.

[14] A. Sfrent and F. Pop, “Asymptotic scheduling for many task computing in big data
platforms,” Information Sciences, vol. 319, pp. 71–91, 2015.

[15] L. Zhang, Z. Li, C. Wu, and M. Chen, “Online algorithms for uploading deferrable big data
to the cloud,” in Proceedings of the IEEE INFOCOM, 2014, pp. 2022–2030.

[16] Q. Zhang, L. Liu, A. Singhand et al., “Improving hadoop service provisioning in a


geographically distributed cloud,” in Proceedings of IEEE Cloud’14, 2014.

[17] A. Vulimiri, C. Curino, P. B. Godfrey, K. Karanasos, and G. Varghese, “Wanalytics:


Analytics for a geo-distributed data-intensive world,” in Proceedings of the CIDR’15, 2015.

[18] K. Kloudas, M. Mamede, N. Preguica, and R. Rodrigues, “Pixida: Optimizing data parallel
jobs in wide-area data analytics,” Proceedings of the VLDB Endowment, vol. 9, no. 2, pp. 72–83,
2015.

[19] Q. Pu, G. Ananthanarayanan, P. Bodik, S. Kandula, A. Akella, P. Bahl, and I. Stoica, “Low
latency geo-distributed data analytics,” in Proceedings of the ACM SIGCOMM’15, 2015.

[20] “Facebook’s prism project,” http://www.wired.com/wiredenterprise/2012/ 08/facebook-


prism/.

[21] J. C. Corbett, J. Dean, M. Epstein et al., “Spanner: Google’s globally distributed database,”
in Proceedings of the OSDI’12, 2012.

[22] “Connecting geographically dispersed datacenters,” HP, 2015.


[23] “Interconnecting geographically dispersed data centers using vpls-design and system
assurance guide,” Cisco Systems, Inc, 2009.
[24] L. Tassiulas and A. Ephremides, “Stability properties of constrained queueing systems and
scheduling policies for maximum throughput in multihop radio networks,” IEEE Transactions on
Automatic Control, vol. 37, no. 12, pp. 1936–1948, 1992.

[25] R. Urgaonkar, U. C. Kozat, K. Igarashi, and M. J. Neely, “Dynamic resource allocation and
power management in virtualized data centers,” in Proceedings of the IEEE NOMS, 2010, pp.
479–486.

[26] F. Liu, Z. Zhou, H. Jin, B. Li, B. Li, and H. Jiang, “On arbitrating the power-performance
tradeoff in saas clouds,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 10,
pp. 2648–2658, 2014.

[27] Y. Yao, L. Huang, A. Sharma, L. Golubchik, and M. Neely, “Power cost reduction in
distributed data centers: A two-time-scale approach for delay tolerant workloads,” IEEE
Transactions on Parallel and Distributed Systems, vol. 25, no. 1, pp. 200–211, 2014.

[28] D. Wu, Z. Xue, and J. He, “icloudaccess: Cost-effective streaming of video games from the
cloud with low latency,” IEEE Transactions on Circuits and Systems for Video Technology, vol.
24, no. 8, pp. 1405– 1416, 2014.

[29] W. Xiao, W. Bao, X. Zhu, C. Wang, L. Chen, and L. T. Yang, “Dynamic request redirection
and resource provisioning for cloud-based video services under heterogeneous environment,”
IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 7, pp. 1954–1967, 2016.

[30] S. Rao, R. Ramakrishnan, A. Silberstein et al., “Sailfish: a framework for large scale data
processing,” in Proceedings of the Third ACM Symposium on Cloud Computing, 2012.

[31] M. Neely, Stochastic Network Optimization with Application to Communication and


Queueing Systems. Morgan and Claypool, 2010.

[32] M. J. Neely, “Opportunistic scheduling with worst case delay guarantees in single and
multi-hop networks,” in Proceedings of the IEEE INFOCOM, 2011, pp. 1728–1736.

[33] M. Arlitt and T. Jin, “A workload characterization study of the 1998 world cup web site,”
IEEE Network, vol. 14, no. 3, pp. 30–37, 2000.

Você também pode gostar