Você está na página 1de 5

 

Data Structures of Big Data: How They Scale


by Dibyendu Bhattacharya and Manidipa Mitra
EMC. (c) 2014. Copying Prohibited.

  

Reprinted for Devjyoti Mitra, RMS


Devjyoti.Mitra@rms.com
Reprinted with permission as a subscription benefit of Books24x7,
http://www.books24x7.com/

All rights reserved. Reproduction and/or distribution in whole or in part in electronic,paper or


other forms without written permission is prohibited.
Data Structures of Big Data: How They Scale

Chapter Hadoop: Optimization for Local Storage Performance


Hadoop has become synonymous with big data processing technologies. The most sought-after technology for distributed parallel computing,
Hadoop solves a specific problem of big data spectrum. Its highly scalable distributed fault tolerant big data processing system is designed to
run on commodity hardware. Hadoop consists of two major components:
1. Hadoop Distributed File System (HDFS), a fault tolerant, highly consistent distributed file system
2. Map Reduce engine which is basically the parallel programming paradigm on top of distributed file system
The Hadoop framework is highly optimized for batch processing of huge amounts of data, where the large files are stored across multiple
machines. HDFS is implemented as a user-level file system in Java which exploits the native Filesystem on each node, such as ext3 or NTFS,
to store data. Files in HDFS are divided into large blocks, typically 64MB, and each block is stored as a separate file in the local file system.
HDFS is implemented by two services, NameNode and DataNode. NameNode is the master daemon for managing the Filesystem metadata,
and DataNode, the slave daemon, actually stores the data blocks. MapReduce engine is implemented by two services called JobTracker and
TaskTracker where JobTracker is the MapReduce master daemon which schedules and monitors distributed jobs and TaskTracker is the
slave daemon which actually performs the specific job tasks.
Hadoop MapReduce applications use storage in a manner that differs from general-purpose computing. First, the data files accessed are
large, typically tens to hundreds of gigabytes in size. Second, these files are manipulated with streaming access patterns typical of batch-
processing workloads. When reading files, large data segments are retrieved per operation, with successive requests from the same client
iterating through a file region sequentially. Similarly, files are also written in a sequential manner. This emphasis on streaming workloads is
evident in the design of HDFS. A simple coherence model (write-once, read-many) is used that does not allow data to be modified once
written (the latest version of Hadoop does support file appends, but it is a complex design which is beyond scope of this article). This is well
suited to the streaming access pattern of target applications, and improves cluster scaling by simplifying synchronization requirements.
Hadoop MapReduce programming model follows the Shared Nothing architecture and individual Map or Reduce tasks does not share any
data structures and thus avoids the synchronization and locking issues. Each file in HDFS is divided into large blocks for storage and access,
typically 64MB in size. Portions of the file can be stored on different cluster nodes, balancing storage resources and demand. Manipulating
data at this granularity is efficient because streaming-style applications are likely to read or write the entire block before moving on to the next.
[1]

HDFS is a user-level file system running on top of an operating system-level file system (e.g. ext3, ext4 file system on UNIX). Any read or write
to HDFS uses the underlying operating system’s support for writing and reading from raw disk. The performance of raw disk is much better for
linear read and write while random read and write disk performance is very poor due to seek time overhead. These linear reads and writes are
the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-
ahead and write-behind caching techniques that pre-fetch data in large block multiples and group smaller logical writes into large physical
writes. We will discuss these two caching techniques in detail when we explain how Kafka (the real-time distributed messaging platform) uses
operating system internal caching to scale. But does Hadoop really need the support for caching technique of Operating System? The answer
is No. The Operating System cache is an overhead for Hadoop as sequential access pattern of MapReduce application which have minimum
locality that can be exploited by cache. Hadoop would perform better if it can bypass OS Cache completely. This is a complex thing to do as
Hadoop HDFS is written in Java, and Java I/O does not support bypass of OS Caching.
Hadoop performance is impacted most when data is not accessed sequentially which may happen due to poor disk scheduling algorithm or
when data gets fragmented on disk during write.

Various studies on Raw Disk performance show that the read and write can reach its peak bandwidth when sequential run length (Sequential
scan of data till random seek happened) is 32 MB. Thus, keeping a HDFS block size of 64MB is very reasonable.

Page 2 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited
Data Structures of Big Data: How They Scale

However, optimized I/O Bandwidth during write and read may not always happen. Data may get fragmented on disk leading to poor read
performance or the operating system scheduling algorithm can cause write bandwidth to decrease which also leads to fragmentation. Let us
examine these two points of disk scheduling and fragmentation of disk in detail and see how Hadoop has solved this problem.
HDFS 0.20.x performance degrades whenever the disk is shared between concurrent multiple writers or readers. Excessive disk seeks occur
that are counter-productive to the goal of maximizing overall disk bandwidth. This is a fundamental problem that affects HDFS running on all
platforms. Existing Operating System I/O schedulers are designed for general-purpose workloads and attempt to share resources fairly
between competing processes. In such workloads, storage latency is of equal importance to storage bandwidth; thus, fine-grained fairness is
provided at a small granularity (a few hundred kilobytes or less). In contrast, MapReduce applications are almost entirely latency insensitive,
and thus should be scheduled to maximize disk bandwidth by handling requests at a large granularity (dozens of megabytes or more). It is
found during testing that in Hadoop 20.x version, aggregate bandwidth drops drastically when moving from 1 writer to 2 concurrent writers and
drops further when more writers are added.
This performance degradation occurs because the number of seeks increases when the number of writers increases because disk is forced
to move (by I/O scheduler) between distinct data streams. Because of these seeks, the average sequential run length decreases dramatically.
In addition to poor I/O scheduling, HDFS also suffers from disk fragmentation when sharing a disk between multiple writers. The maximum
possible file contiguity—the size of an HDFS block—is not preserved by the general-purpose file system when making disk allocation
decisions.
Similar performance degradation occurs when the number of readers increases. Disk scheduling between multiple read operations degrades
the overall read bandwidth. Also, as fragmentation increases (due to disk scheduling during writing), the sequential run length decreases and
the amount of disk seeks will increase and the read bandwidth will decrease.

The diagram[2] below shows the performance impact of different Filesystems when the number of readers and writers increases and also the
impact on fragmentation on Hadoop version 20.x.

In this diagram, concurrent writes on Linux exhibited better performance characteristics than FreeBSD. The ext4 file system showed 8%
degradation moving between 1 and 4 concurrent writers, while the XFS file system showed no degradation. In contrast, HDFS on Linux had
worse performance for concurrent reads than FreeBSD. The ext4 file system degraded by 42% moving from 1 to 4 concurrent readers, and
XFS degraded by 43%. Finally, fragmentation was reduced on Linux, as the ext4 file system degraded by 8% and the XFS file system by 6%
when a single reader accessed files created by 1 to 4 concurrent writers.
How does the latest version of Hadoop solve this problem of scheduling and fragmentation? The key is to make HDFS smarter and present
the request to the Operating System in the order it wants to process. Let us elaborate on this point a bit. The fundamental problem of disk
scheduling and fragmentation is caused as OS I/O scheduler finds there are multiple writers/readers trying to issue write/read calls and
schedulers applying scheduling algorithms on these processes.

Page 3 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited
Data Structures of Big Data: How They Scale

In the diagram above there are four clients issuing write/read call to the Operating System. The earlier version of Hadoop spawns four threads
for every client. The Operating System finds there are four processes trying to access the Disk, and it applies scheduling (i.e. Round Robin or
Time Sharing) between the four processes, which leads to poor I/O bandwidth and disk fragmentation.
The diagram below shows how latest version of Hadoop solves this problem.

In the diagram above, four clients try to access the disk, but now HDFS buffers the request and schedules them to disk at any specified
granularity (say 64 MB) using a single thread. Since now only a single thread per disk is trying to write/read the buffer, from an operating
system point of view, it is just one process trying to access the disk. Thus, OS does not perform any expensive scheduling overhead leading to
less fragmentation and hence, higher I/O bandwidth.
There are various other optimizations are done in HDFS to make it perform better. In this article, we discuss one issue around disk
fragmentation and scheduling. We also touch upon the operating system caching mechanism which is an overhead for MapReduce type of
access pattern. Various studies have found that if Hadoop would perform better if it can bypass OS Cache. In some cases, if Hadoop can
bypass the complete OS Filesystem and directly access raw disk it can overcome all the OS overhead. The complexity and challenges of this
type of approach can be a different topic of conversation entirely.
Even though Hadoop is not able to get the best out of OS Cache layer, there is another Big Data infrastructure called Kafka
(https://kafka.apache.org/) which greatly benefited from the OS Caching support to scale. Kafka is an open source distributed messaging
system developed by LinkedIn which must scale their messaging infrastructure to process hundreds of thousands messages per second. As
any existing traditional messaging system not able to solve this kind of high volume requirement, LinkedIn had developed Kafka which can
scale to such a high write and read throughput. Kafka is designed to rely heavily on OS Cache mechanism to scale. Let’s explore how they
made Kafka scale.

Page 4 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited
Data Structures of Big Data: How They Scale

[1]
http://www.cs.rice.edu/CS/Architecture/docs/phdthesis- shafer.pdf

[2]Fig 3.9 http://www.cs.rice.edu/CS/Architecture/docs/phdthesis- shafer.pdf

Page 5 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited

Você também pode gostar