Você está na página 1de 3

A Review Paper on Big Data and Networking

Abstract: The term Big Data describes innovative techniques and technologies to capture, store,
distribute, manage and analyze petabyte- or larger-sized datasets with high-velocity and different
structures. Big data can be structured, unstructured or semi-structured, resulting in incapability of
conventional data management methods. Data is generated from various different sources and can
arrive in the system at various rates. In order to process these large amounts of data in an inexpensive
and efficient way, parallelism is used. Big Data is a data whose scale, diversity, and complexity require
new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden
knowledge from it. Hadoop is the core platform for structuring Big Data, and solves the problem of
making it useful for analytics purposes. Hadoop is an open source software project that enables the
distributed processing of large data sets across clusters of commodity servers. It is designed to scale up
from a single server to thousands of machines, with a very high degree of fault tolerance..

1.Introduction 2. Reduce: Takes the output from Map


outputs the results
Big Data and their relationship with Network
System SHUFFLE

MAP
1. Why, What, and How of Big Data: Its all REDUCE
because of advances in networking INPUT
MAP
2. Recent Developments in Networking and REDUCE
MAP
their role in Big Data (Virtualization, SDN, NFV)
Pic 2. Simplified Data Processing on Large Clusters
3. Networking needs Big Data
What Is Hadoop
BIG DATA
An open source implementation of MapReduce
Large Storage FAST COMPUTING

CLOUD It was named by Doug Cutting at Yahoo after his


sons yellow plus elephant
VIRTUALIZATION
Hadoop File System (HDFS) requires data to be
NETWORKING
broken into blocks. Each block is stored on 2 or
more data nodes on different racks.
Pic 1. Big Data and Network Structure

Understanding MapReduce

Software framework to process massive


amounts of unstructured data by distributing it
over a large number of inexpensive processors

1. Map: Takes a set of data and divides it for


computation
Pic 3. Hadoop architecture

Name node: Manages the file system name


space
keeps track of blocks on various Data Nodes. Todays data center: Tomorrow data Center:

Job Tracker: Assigns MapReduce jobs to task Tens of tenants 1k of clients


tracker nodes that are close to the data (same Hundreds of switches 10k of Switches
rack) and routers

Task Tracker: Keep the work as close to the data Thousands of servers 1M of VMs
as possible.
Hundreds of Tens of Administrators
administrators

Need to monitor traffic patterns and rearrange


virtual networks connecting millions of VMs in
real-time, it means that :

Managing clouds is a real-time big data


problem.
Pic 4. Block of HDFS
Internet of things Big Data generation and
Like the ecositem we can describe that Hadoop analytics
is a framework for processing, storing, and
analyzing massive amounts of distributed 2. Networking in Big Data
unstructured data. As a distributed file storage
subsystem, Hadoop Distributed File System Networking Requirements for Big Data
(HDFS) was designed to handle petabytes and 1. Code/Data Collocation: The data for map jobs
exabytes of data distributed over multiple should be at the processors that are going to
nodes in parallel. Picture below can illustrates map.
an overview of Hadoops deployment in a big
data analytics environment. 2. Elastic bandwidth: to match the variability of
volume

3. Fault/Error Handling: If a processor fails, its


task needs to be assigned to another processor
(CPU)

4.Security: Access control (authorized users


only), privacy (encryption), threat detection, all
in real-time in a highly scalable manner 5.
Synchronization: The map jobs should be
comparables so that they finish together.
Similarly reduce jobs should be comparable.

Recent Developments in Networking

1. High-Speed: 100 Gbps Ethernet


Pic 4. Hadoop Deployment Structure 400 Gbps 1000 Gbps
2. Cheap storage access. Easy to move big data.
Big Data for Networking
3. Virtualization
Today VS Tomorrow : 4. Software Defined Networking
5. Network Function Virtualization

Virtualization

Recent networking technologies and standards


allow:

1. Virtualizing Computation

2. Virtualizing Storage

3. Virtualizing Rack Storage Connectivity

4. Virtualizing Data Center Storage

5. Virtualizing Metro and Global Storage

Virtualizing Computation

Initially data centers consisted of multiple IP


subnets

Each subnet = One Ethernet Network


Ethernet addresses are globally unique
and do not change
IP addresses are locators and change
every time you move
If a VM moves inside a subnet No
change to IP address Fast
If a VM moves from one subnet to
another Its IP address changes All
connections break Slow Limited
VM mobility

IEEE 802.1ad-2005 Ethernet Provider Bridging


(PB), IEEE 802.1ah-2008 Provider Backbone
Bridging (PBB) allow Ethernets to span long
distances Global VM mobility

Pic 4. Layer Networking Architecture

Você também pode gostar