Hadoop

Hadoop:
Hadoop is a free, Apache foundation Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative. Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems areWindows and Linux but Hadoop can also work with BSD and OS X. So, it is: Apache Hadoop is an open source framework for developing distributed applications that can process very large amounts of data. It is a platform that provides both distributed storage and computational capabilities.
Hadoop has two main layers: Computation layer: The computation tier uses a framework called MapReduce. Distributed storage layer: A distributed filesystem called HDFS provides storage.
Why Hadoop ? Building bigger and bigger servers is no longer necessarily the best solution to large-scale problems. Nowadays the popular approach is to tie together many lowend machines together as a single functional distributed system. For example,
A high-end machine with four I/O channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the Hadoop Distributed File System (HDFS). With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. Moreover its cheaper than one high-end server! For computationally intensive work: Most of the distributed systems (eg: SETI@home) are having approach of moving the data to the place where computation will take place And after the computation, the resulting data is moved back for storage. This approach works fine for computationally intensive work. For data-intensive work: We need other better approach, Hadoop has better philosophy toward that Because Hadoop focuses on moving code/algorithm to data instead data to the code/algorithm. The move-code-to-data philosophy applies within the Hadoop cluster itself, And data is broken up and distributed across the cluster, And computation on a piece of data takes place on the same machine where that piece of data resides. Hadoop philosophy of move-code-to-data makes more sense As we know the code/algorithm are always smaller than the Data hence code/algorithm is easier to move around.
Hadoop Key Features Distributed computing is the very vast field but following key features has made Hadoop very distinctive and attractive. 1. Accesible Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon's Elastic Compute Cloud (EC2). 2. Robust As Hadoop is intended to run on commodity hardware, It is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. 3. Scalable Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
4. Simple Hadoop allows users to quickly write efficient parallel code. Hadoop's accessibility and simplicity give it an edge over writing and running large distributed programs. Advantages of Hadoop Following are the list of major areas where Hadoop is found very strong:
Hadoop is a platform which provides Distributed storage & Computational capabilities both.
Hadoop is extremely scalable, In fact Hadoop was the first considered to fix a scalability issue that existed in Nutch(- the open source crawler and search engine).
One of the major component of Hadoop is HDFS (the storage component) that is optimized for high throughput.
HDFS uses large block sizes that ultimately helps It works best when manipulating large files (gigabytes, petabytes...).
Scalability and Availability are the distinguished features of HDFS to achieve data replication and fault tolerance system.
HDFS can replicate files for specified number of times (default is 3 replica) that is tolerant of software and hardware failure, Moreover it can automatically re-replicates data blocks on nodes that have failed.
Hadoop uses MapReduce framework which is a batch-based, distributed computing framework, Itallows paralleled work over a large amount of data.
MapReduce let the developers to focus on addressing business needs only, rather than getting involved in distributed system complications.
To achieve parallel & faster execution of the Job, MapReduce decomposes the job into Map & Reduce tasks and schedules them for remote execution on the slave or data nodes of the Hadoop Cluster.
Disadvantages or Limitation of Hadoop Following are the major common areas found as weaknesses of Hadoop framework or system:
As you know Hadoop uses HDFS and MapReduce, Both of their master processes are single points of failure, Although there is active work going on for High Availability versions.
Until the Hadoop 2.x release, HDFS and MapReduce used the singlemaster models which can result in single points of failure.
Security is also one of the major concern because Hadoop does offer a security model But by default it is disabled because of its high complexity.
Hadoop does not offer storage or network level encryption which is very big concern for government sector application data.
HDFS is inefficient for handling small files, and it lacks transparent compression. As HDFS is not designed to work well with random reads over small files due to its optimization for sustained throughput.
MapReduce is a batch-based architecture that means it does not lend itself to use cases which needs real-time data access.
MapReduce is a shared-nothing architecture hence Tasks that require global synchronization or sharing of mutable data are not a good fit which can pose challenges for some algorithms.
Which version of Hadoop should you use?
There are a few active release series. The 1.x release series is a continuation of the 0.20 release series, and contains the most stable versions of Hadoop currently available.
This series includes secure Kerberos authentication, which prevents unauthorized access to Hadoop data. Almost all production clusters use these releases, or derived versions (such as commercial distributions).
The 0.22 and 0.23 release series16 are currently marked as alpha releases (as of early 2012), but this is likely to change by the time you read this as they get more real-world testing and become more stable.
0.23 includes several major new features: A new MapReduce runtime, called MapReduce 2, implemented on a new system called YARN (Yet Another Resource Negotiator), which is a general resource management system for running distributed applications.
Current Version of Hadoop and its release dates: 15 October, 2013: Release 2.2.0 available
Thus Hadoop is surely a technology that has already eased the big data transmission and provided computational as well as distributed functioning. It is the future of Big Data and IT industry. Hadoop has replaced the entire requirement to install servers for storage and the HDFS makes it easier to store and access data over the network. The move-codeto-data feature of Hadoop put many feathers to the recent technology.

Hadoop

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Hadoop

Enviado por

Direitos autorais:

Formatos disponíveis

Hadoop:

Which version of Hadoop should you use?

Você também pode gostar