Escolar Documentos
Profissional Documentos
Cultura Documentos
HDFS is a user-level file system running on top of an operating system-level file system (e.g. ext3, ext4 file system on UNIX). Any read or write
to HDFS uses the underlying operating system’s support for writing and reading from raw disk. The performance of raw disk is much better for
linear read and write while random read and write disk performance is very poor due to seek time overhead. These linear reads and writes are
the most predictable of all usage patterns, and are heavily optimized by the operating system. A modern operating system provides read-
ahead and write-behind caching techniques that pre-fetch data in large block multiples and group smaller logical writes into large physical
writes. We will discuss these two caching techniques in detail when we explain how Kafka (the real-time distributed messaging platform) uses
operating system internal caching to scale. But does Hadoop really need the support for caching technique of Operating System? The answer
is No. The Operating System cache is an overhead for Hadoop as sequential access pattern of MapReduce application which have minimum
locality that can be exploited by cache. Hadoop would perform better if it can bypass OS Cache completely. This is a complex thing to do as
Hadoop HDFS is written in Java, and Java I/O does not support bypass of OS Caching.
Hadoop performance is impacted most when data is not accessed sequentially which may happen due to poor disk scheduling algorithm or
when data gets fragmented on disk during write.
Various studies on Raw Disk performance show that the read and write can reach its peak bandwidth when sequential run length (Sequential
scan of data till random seek happened) is 32 MB. Thus, keeping a HDFS block size of 64MB is very reasonable.
Page 2 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited
Data Structures of Big Data: How They Scale
However, optimized I/O Bandwidth during write and read may not always happen. Data may get fragmented on disk leading to poor read
performance or the operating system scheduling algorithm can cause write bandwidth to decrease which also leads to fragmentation. Let us
examine these two points of disk scheduling and fragmentation of disk in detail and see how Hadoop has solved this problem.
HDFS 0.20.x performance degrades whenever the disk is shared between concurrent multiple writers or readers. Excessive disk seeks occur
that are counter-productive to the goal of maximizing overall disk bandwidth. This is a fundamental problem that affects HDFS running on all
platforms. Existing Operating System I/O schedulers are designed for general-purpose workloads and attempt to share resources fairly
between competing processes. In such workloads, storage latency is of equal importance to storage bandwidth; thus, fine-grained fairness is
provided at a small granularity (a few hundred kilobytes or less). In contrast, MapReduce applications are almost entirely latency insensitive,
and thus should be scheduled to maximize disk bandwidth by handling requests at a large granularity (dozens of megabytes or more). It is
found during testing that in Hadoop 20.x version, aggregate bandwidth drops drastically when moving from 1 writer to 2 concurrent writers and
drops further when more writers are added.
This performance degradation occurs because the number of seeks increases when the number of writers increases because disk is forced
to move (by I/O scheduler) between distinct data streams. Because of these seeks, the average sequential run length decreases dramatically.
In addition to poor I/O scheduling, HDFS also suffers from disk fragmentation when sharing a disk between multiple writers. The maximum
possible file contiguity—the size of an HDFS block—is not preserved by the general-purpose file system when making disk allocation
decisions.
Similar performance degradation occurs when the number of readers increases. Disk scheduling between multiple read operations degrades
the overall read bandwidth. Also, as fragmentation increases (due to disk scheduling during writing), the sequential run length decreases and
the amount of disk seeks will increase and the read bandwidth will decrease.
The diagram[2] below shows the performance impact of different Filesystems when the number of readers and writers increases and also the
impact on fragmentation on Hadoop version 20.x.
In this diagram, concurrent writes on Linux exhibited better performance characteristics than FreeBSD. The ext4 file system showed 8%
degradation moving between 1 and 4 concurrent writers, while the XFS file system showed no degradation. In contrast, HDFS on Linux had
worse performance for concurrent reads than FreeBSD. The ext4 file system degraded by 42% moving from 1 to 4 concurrent readers, and
XFS degraded by 43%. Finally, fragmentation was reduced on Linux, as the ext4 file system degraded by 8% and the XFS file system by 6%
when a single reader accessed files created by 1 to 4 concurrent writers.
How does the latest version of Hadoop solve this problem of scheduling and fragmentation? The key is to make HDFS smarter and present
the request to the Operating System in the order it wants to process. Let us elaborate on this point a bit. The fundamental problem of disk
scheduling and fragmentation is caused as OS I/O scheduler finds there are multiple writers/readers trying to issue write/read calls and
schedulers applying scheduling algorithms on these processes.
Page 3 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited
Data Structures of Big Data: How They Scale
In the diagram above there are four clients issuing write/read call to the Operating System. The earlier version of Hadoop spawns four threads
for every client. The Operating System finds there are four processes trying to access the Disk, and it applies scheduling (i.e. Round Robin or
Time Sharing) between the four processes, which leads to poor I/O bandwidth and disk fragmentation.
The diagram below shows how latest version of Hadoop solves this problem.
In the diagram above, four clients try to access the disk, but now HDFS buffers the request and schedules them to disk at any specified
granularity (say 64 MB) using a single thread. Since now only a single thread per disk is trying to write/read the buffer, from an operating
system point of view, it is just one process trying to access the disk. Thus, OS does not perform any expensive scheduling overhead leading to
less fragmentation and hence, higher I/O bandwidth.
There are various other optimizations are done in HDFS to make it perform better. In this article, we discuss one issue around disk
fragmentation and scheduling. We also touch upon the operating system caching mechanism which is an overhead for MapReduce type of
access pattern. Various studies have found that if Hadoop would perform better if it can bypass OS Cache. In some cases, if Hadoop can
bypass the complete OS Filesystem and directly access raw disk it can overcome all the OS overhead. The complexity and challenges of this
type of approach can be a different topic of conversation entirely.
Even though Hadoop is not able to get the best out of OS Cache layer, there is another Big Data infrastructure called Kafka
(https://kafka.apache.org/) which greatly benefited from the OS Caching support to scale. Kafka is an open source distributed messaging
system developed by LinkedIn which must scale their messaging infrastructure to process hundreds of thousands messages per second. As
any existing traditional messaging system not able to solve this kind of high volume requirement, LinkedIn had developed Kafka which can
scale to such a high write and read throughput. Kafka is designed to rely heavily on OS Cache mechanism to scale. Let’s explore how they
made Kafka scale.
Page 4 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited
Data Structures of Big Data: How They Scale
[1]
http://www.cs.rice.edu/CS/Architecture/docs/phdthesis- shafer.pdf
Page 5 / 5
Reprinted for Devjyoti.Mitra@rms.com, RMS EMC, EMC Proven Professional Knowledge Sharing (c) 2014, Copying Prohibited