Você está na página 1de 4

SOLUTION SHEET

Syncsort DMX-h for


Hadoop ETL
Unleash Hadoops potential with a smarter
approach to integrating and processing Big Data

APACHE HADOOP is gaining traction as a general


purpose framework for collecting, processing and
storing Big Data and ETL is emerging as the key use
case for those Hadoop implementations. Unfortunately,
as organizations ramp up their Hadoop initiatives, they
often face barriers that undermine its potential. Syncsort
offers a unique approach to Hadoop that lowers the
barriers for wider adoption and helps organizations
unleash the full potential of Hadoop, making it a more
robust environment for the enterprise. With Syncsort,
organizations can now use existing ETL skills to
accelerate their Hadoop initiatives, and process more
data in less time, with less money and fewer resources.

Big Data Is Breaking Existing Architectures


As organizations try to make sense of the ever-expanding
data avalanche, they are hitting the architectural limits
of their data processing environments. As a result,
organizations are increasingly adopting the Hadoop
MapReduce framework as a means to scale the collection
and processing of data while reducing costs. Unfortunately,
most traditional ETL tools only generate complex code to be
executed in Hadoop. Without any real integration, Hadoop
can turn into a heavy burden, forcing organizations to:
Acquire expensive, hard-to-find MapReduce skills.
Understand and manually maintain thousands of
lines of code, even for simple ETL flows.
Constantly add lots of hardware in order to scale
and maintain service level agreements.

Syncsort delivers a smarter approach to Hadoop ETL,


enabling organizations to:
Connect to virtually any data source, including
mainframe and MPP databases.
Move data into and out of Hadoop up to 6x faster
without the need for manual scripts.
Develop MapReduce ETL processes without
writing code.
Seamlessly accelerate Hadoop performance and
scalability for sort and ETL operations.

Smart Connectivity for Faster Data


Loads and Extracts
DMX-h writes data directly to HDFS using native Hadoop
interfaces. DMX-h can partition the data and parallelize
the loading processes to load multiple streams
simultaneously into HDFS, reducing loading times by up
to 6x, compared with the Hadoop put command.
Shortening the time it takes to get data into HDFS can
be critical for many companies, such as those that
must load billions of records each day. Reducing load
times can also be critical for organizations that plan to
increase the amount (and types) of data they will need to
load into Hadoop as their application or business grows.
In addition to providing fast and efficient loading, DMX-h
is commonly used to pre-process data prior to loading,
which alleviates complexity and inefficiencies that can
occur by loading raw, source data into HDFS directly.
By integrating and structuring the data with DMX-h
prior to loading it into HDFS, load times are reduced,
downstream MapReduce processing tasks are easier to
develop and execute faster and more efficiently, and
storage requirements on the cluster are reduced.

SOLUTION SHEET

SYNCSORT DMX-h FOR HADOOP ETL

Same Familiar Tool. Five Core Transformations. All The Possibilities.

One tool to connect Hadoop to all sources and


targets, including mainframe sources
Develop MapReduce ETL processes
without writing code
Leverage existing ETL skills
Development accelerators for CDC, and more
Five smart transformations
Patented algorithms
No code generation, no compiling
Execute within MapReduce
Sort

Join

Aggregate

Copy

Merge

Development accelerators for CDC


and other common data flows

DMX-h can also extract data from Hadoop to other data


stores, leveraging Hadoop as a key processing step in a
workflow that includes other database technologies for
example, using Hadoop to process data prior to loading it
into an analytic data warehouse or appliance.

Smart Development. No Coding.


No Scripting.
DMX-h provides a simple, powerful, and seamless
environment for developing MapReduce tasks for Hadoop.
It makes the development and maintenance of ETL tasks for
MapReduce, and applications that move data into and out of
HDFS, faster, easier, and less error prone.
The solution enables people with a much broader range
of skills not just developers to create ETL tasks that
execute within the MapReduce framework, replacing
complex Java programming or Pig scripting with a
powerful, easy-to-use graphical development environment.
It also simplifies the development of applications that load
data into HDFS, or that extract data from HDFS and load it
into other systems.
DMX-h makes it faster and easier to develop, maintain, and
re-use applications that execute on Hadoop via:

combine & reuse to create virtually


any data flow
Coding is optional, but not required.

Comprehensive built-in transformations.


Native mainframe data access and conversion
capabilities.
Heterogeneous DBMS access on the cluster for loading
Hadoop, loading warehouses from Hadoop without the
need for a temporary landing area, and for sourcing
data for lookups and other transformations.
A graphical development environment.
Built-in metadata capabilities, which enable greater
transparency into impact analysis, data lineage, and
execution flow.

Hadoop Integration... for Real


DMX-h can seamlessly replace the native sort within
MapReduce processing, providing performance benefits for
MapReduce tasks written in any language including Java
and Pig. This simple change can increase the performance
of sort steps in Hadoop by 2 to 3x with no programming
changes or tuning required for new or existing MapReduce
tasks. As a result, 2 to 3x more data can be processed in the
same amount of time on the same cluster.
DMX-h has a very small footprint, so it can be easily
deployed on every node, complementing Hadoops

SOLUTION SHEET

SYNCSORT DMX-h FOR HADOOP ETL

horizontal scalability by maximizing performance of each


node within the Hadoop cluster. Once deployed, DMX-h
automatically optimizes the resource utilization (e.g., CPU,
memory and I/O) on each node to deliver the highest levels
of performance, scalability, and throughput, with no manual
tuning needed.
The superior runtime performance of DMX-h is a result
of thousands of deployments, leveraging hundreds
of production-proven optimizations, with important
innovations in four areas:
A library of high performance algorithms for all key,
set-related data transformations.
Direct I/O access for the fastest data transfers.
High-performance compression to minimize I/O
and intermediate work file sizes.
A dynamic ETL optimizer to ensure maximum
performance at runtime, with minimum resource
utilization.

Benefits of DMX-h for Hadoop ETL


DMX-h delivers measurable strategic, financial, and
operational benefits to organizations across a wide range of
industries.
Faster Time to Insight. Organizations make better
decisions faster based on more accurate insights,
by processing more data in less time, with the same
resources.
Lower OPEX and CAPEX Costs. Organizations reduce
capital and operational expenses by eliminating the
need for additional compute nodes on the cluster, due
to more efficient hardware utilization.
More Jobs, Same Cluster. DMX-h provides better
performance and scalability for processing data in Hadoop,
enabling organizations to process up to 3x more data in
less time using the same, or fewer, resources. Better
performance means jobs finish sooner, freeing the cluster
to handle more jobs within the same processing windows,
avoiding incremental capital expense.

Hadoop Integration for Real (No Code Generation. No Compiling. No Tuning.)

Runs natively within MapReduce


Small footprint installs on every node
Open source contributions extend
capabilities of MapReduce
Pluggable sort
Expanded use cases (i.e. No sort option)
Vertical scalability
Design exibility (Map Map Reduce Reduce)

Hadoop Data Nodes

SOLUTION SHEET

SYNCSORT DMX-h FOR HADOOP ETL

Cost-effective Scalability. DMX-h enables Hadoop


clusters to scale more efficiently and cost-effectively
because data processing and loading performance does
not degrade as data volumes grow.
Improved Developer Productivity. Organizations simplify
developing, maintaining, and re-using ETL tasks on
Hadoop. DMX-h makes it easier to integrate and load
data from heterogeneous data sources, perform ETL
tasks within the MapReduce framework, and extract data
from Hadoop to other data stores.
Reduced Dependence on Expensive, Specialized New
Hires. Organizations minimize or eliminate the need to
hire new, specialized staff with expensive programming
skills (e.g. Java, Pig, Sqoop, etc.), lowering staffing and
training costs, and speeding time to value. Existing
staff can get more done with powerful, easy-to-use
tools, which increase productivity and lower application
development, maintenance, and re-use costs.

Exceptional Performance SLAs. Organizations are


able to more easily and cost-effectively meet or
exceed performance service level agreements (SLAs),
eliminating the risk of encountering performance
SLA penalties.
Increased Transparency. Built-in metadata capabilities
enable greater transparency into impact analysis, data
lineage, and execution flow, which facilitates re-use,
data governance and regulatory compliance.
Fast and painless installation and configuration.
Installation and configuration of DMX-h is fast and
simple, minimizing the time it takes to go from a
standing start to full productivity.
For more information, visit www.syncsort.com/hadoop.

Unleash Hadoops Potential

Minutes

Easy Setup
and Administration

ETL

2x
Faster

TeraSort Benchmark

Smart, Self-tuning
Engine

Light Footprint

Logging,
scheduling

Elapsed Time (min)

250

Single Install

2x

The Faster
Sort Technology

ETL Aggregations

Native Sort

200
150

Syncsort

100

Faster
MapReduce Jobs

Faster
3000

Elapsed Time (min)

<

TeraSort

50

Pig

2500

Java

2000
1500
1000

Syncsort

500

0
0

1000

2000

3000

4000

File Size (GB)

5000

0
0

500

1000

1500

File Size (GB)

2000

2500

About Syncsort
Syncsort provides data-intensive organizations across the big data continuum with a
smarter way to collect and process the ever-expanding data avalanche. With thousands of
deployments across all major platforms, including mainframe, Syncsort helps customers
around the world to overcome the architectural limits of todays ETL and Hadoop
environments, empowering their organizations to drive better business outcomes in less
time, with less resources and lower TCO. For more information visit www.syncsort.com.
2013 Syncsort Incorporated. All rights reserved. All company and product names used herein may be the trademarks
of their respective companies. DMX-SC-001-0213US

50 Tice Boulevard, Woodcliff Lake, NJ 07677


201.930.8200 | www.syncsort.com

3000

Você também pode gostar