ETL Benchmark Favours DataStage and Talend

Create an Account Log In
Blogs Discussions Research Directory

Toolbox for IT Topics Business Intelligence Blogs
Tweet 2 0 0
ETL Benchmark Favours DataStage and
Talend
Vincent McBurney Dec 9, 2008 | Comments (7)
A French consulting company called Manapps has released an ETL benchmark report
that compares Talend, Pentaho Kettle, DataStage and Informatica and two of those
four vendors will be pleased with the results.
*** March 2009 Update: I was contacted my ManApps informing me the benchmark
report was a draft and not the final report. There is a final report that shows more
favourable results for Informatica and I will update this blog post when I have some
spare time. Here is the statement from ManApps as translated from French by Google
Translate:
We publish on our website version of the document "Benchmark ETL" whose original
version was wrongly found work published on various sites or blogs outside our
company. This version significantly modifies certain measures have been taken based
on more advanced technical parameters regarding Power Center Solution Informatica,
we did not have in the original version. The findings have been modified accordingly.
Note that this document remains a working document. It has no goal of marketing. We
want to provide publishers covered by the study may be required to take any action they
deem useful, and if necessary publish the results accordingly.
*** End of update
I am going to write a post on what I think of the objectives of this benchmark but for
now here are the results and an analysis of each test. Manapps is part of
OmegaHighTech, a company with over 3,500 employees around the world. Amongst
other things Manapps do Business Intelligence consulting and data warehouse
implementations.
The Benchmark is released under creative commons license:
You are free:
to Share to copy, distribute, display, and perform the work
to Remix to make derivative works
I hear Coldplay have already taken the words to use in their next song.
I dont know how they distributed the PDF I found it in a blog post by Marc Russell:
ETL Benchmark by Manapps. Ive copied the graphs into this blog post with some
comments and cropped off the top of the graphs for readability and because I dont
think the really high scores are reflective of the products but show poor ETL design.
Event Number 1 Sequential Files
The first test was reading a sequential file and writing out to a sequential file. Anyone
who knows DataStage can guess the result DataStage Server Edition will be
awesome and DataStage PX was not so great:
1 Recommend Recommend Share Share
Your email address FOLLOW
BEGIN NOW
Tooling Around in the IBM InfoSphere
by Vincent McBurney
Vincent McBurney is an IBM Champion for Information
Integration and has been blogging for many years on
InfoSphere software and ... more
Receive the latest blog posts:
Share Your Perspective
Share your professional knowledge and experience
with peers. Start a blog on Toolbox for IT today!
ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag...
1 of 12 21/01/2014 10:37
Benchmark 1 Results ETL Sequential File Processing
DataStage Server Edition LOVES sequential files. Its been optimised over 15 years
of releases not just reading and writing sequential data but memory caching and row
buffering in the middle. Look at the 5 million row result less than a third of the time
of the nearest competition. This is one of the reasons why DataStage Server Edition
customers who have a lot of low to mid range data sizes and share data in sequential
files are sticking with Server Edition.
DataStage PX tolerates sequential files it imports the data and converts it to parallel
format and then exports it again back to sequential format. Not only that but because
they chose 2 nodes DataStage PX had to partition and the unpartition the damn data.
Let me show you how much this sucks. Lets say you have a wheel barrow with two
sacks of flour in it and youve got to deliver it to the king of Sparta before sunset or
hell kick you into a bottomless pit. With most of these ETL tools you pick up the
wheelbarrow and run like hell to the king. With DataStage PX you pick up the
wheelbarrow, you wheel it over to two wheelbarrows and put a sack of flour in each
and then clone yourself and the two of you push both those wheelbarrows to the
palace where you swap them back to one wheel barrow and give it to the king. He
kicks you and your doppleganger down the bottomless pit and throws the
wheelbarrows after you. If you have 100 flour bags your parallel wheelbarrows are
great, but with two wheat bags its a waste of time.
A single job parameter that tells this job to run in sequential mode could have made it
as much as 50% faster.
Informatica youch! They are like Forest Gump before the leg braces came off.
They finished the first race after DataStage Server finished the third race.
Informatica was the only ETL tool in this test that used three stages to do this job
instead of two. They did a file input and then a file delimiter definition painfully slow
row-by-row delimiter definition. Im no expert but someone tell me this was a dumb
job design. Eventually Informatica picked up speed and in the 20 million test it came
second.
Test 2 MySQL
This test only compared Talend and Pentaho writing to mySQL. In this test they had
two versions of Talend TOS 2.4.1 and TOS 2.4.1 extended insert. This tells me that
Manapps did a bit more research on Talend than Pentaho Kettle. By this stage Im
thinking this test is rigged. Fun, but rigged.
Test 3 Read a Database
In test 3 Manapps has a test to read from an Oracle database table and write out to a
sequential file.
Work With Me
Links
Categories
GO
If you are an expert in InfoSphere software and want to work
for the biggest IBM partner in Australia and New Zealand get
in touch with me via ITToolbox or Linked In.
Steal This IM Methodology
Informatica Data Quality Blog
DataFlux Community of Experts
Data Governance Blog
dq:view - Steve Tuck on Data Quality
2 of 12 21/01/2014 10:37

Test Results 2 ETL Reading a Database
DataStage PX did surprisingly well it had three stages, database modify
sequential file. It repartitioned twice (needlessly) so could be improved but because it
has the Oracle Enterprise stage it kicked butt on the database read. DataStage
Server did not so well and I would have loved to tweak the array size and transaction
size properties to read the database data in one chunk. Might have got a huge
performance boost.
Informatica well Ive chopped off the top of the graph to make it readable but again
they had the extra stage and that added 40 seconds to each job run time. Surely
there is a way to avoid that 40 second lag.
Test 4 Database Bulk Load
This test was notable as the only one that Pentaho Kettle won. At least in the
miniscule volume category. The editor must have been asleep at the wheel when he
let this one through. They did worse as the volumes went up.

Test Result 4 Feeding Oracle Bulk Load
It was another test that Server Edition won comfortably because this is essentially the
same as test 1 its all about the speed of creating a sequential file. After youve got
the file all five ETL tools call the exact same Oracle bulk loader. Test 1 and Test 4
identical.
Test 5 Same as Test 1 with a Transform in it
Finally, finally! An ETL test that has Transform in it (you know, the T in ETL?) All the
tests up to now were bullshit, this is the first true ETL test and it took Manapps five
tests to get here:
3 of 12 21/01/2014 10:37
Test Result 5 ETL Transformer
This was the first time the parallel partitioning of DataStage PX may have helped
rather than hinder. With two nodes doing those transform functions the higher the
volumes the more it wins. We finally see Informatica do well coming second in the 20
million row. What I would give to see a 1,000,000,000 row test. Once again
Informatica had that initial 35 second handicap but you take that out and it performed
really well.
Test 6 ELT
This was a silly test as Manapps didnt have DataStage ELT and didnt know how to
use Informatica ELT. They tried another way to force those product to use ELT but got
it wrong. The one thing I will say about this test is that Talend seems to have a good
GUI for simple ELT:
Benchmark Job Talend ELT
This job lets you define an oracle connection, aggregate the data and save it to
another Oracle table. It ran in under 2 seconds for up to 1 million rows so I assume it
pushes it all down to the database. This is a good ETL way to write an aggregation
it runs just like a database group by command but its got an open and easy to read
data lineage. This is just the type of thing I would expect from DataStage and
Informatica ELT if you can get it running.
What Manapps did in this test that is kind of sneaky is have the source and target
table in the same database which kind of defeats the purpose of using an ETL tool.
A true ELT scenarios in the involves different source and target databases. The other
four ETL tools could have done the exact same thing with user-defined SQL select
though the data lineage would not have been as good.
The tester made the comment:
Only Talend Open Studio permits to use an ELT mod. Informatica got the Push
Down Optimization, but I didnt find this feature on the tool.
Youve got to buy the add on! Its not free with the tool! They are not as charitable as
Talend.
Test 7 more ELT
This is the most interesting test in the benchmark because it shows how ETL engines
process faster than ELT by throwing grunt, memory and hardware at the transform
part of the job. This test compares a pure ELT command from Talend versus tradition
ETL from DataStage and ETL wins:
4 of 12 21/01/2014 10:37
Benchmark Job Talend complex ELT
Benchmark Job DataStage Join and Filter
The first diagram is the clever Talend ELT interface that leaves the data on the
database and performs some mapping on it. The second diagram is traditional ETL,
DataStage reads the data and then transforms, joins, transforms, filters and writes it
out. It looks like its doing a lot more work but dont let those Modify stages deceive
you they are almost zero overhead and since there are no sequential files this job is
in its element and comes out fastest:
Benchmark Result 7 Join and Filter
Talend has just one processing engine the database. DataStage has two the
database and the ETL server. The higher the volume the faster DataStage PX will
go. Manapps only tested up to 1,000,000 in this test despite testing higher volumes in
other tests. I would have liked a 20,000,000 test for this one.
It would be so so very much faster with a tiny bit of tuning. You see this little
DataStage symbol: . Thats pain time. Thats data being sorted and
repartitioned swapping flour bags between wheelbarrows. If you sort the data in the
source database stages and remove the sorts from the job this baby runs a lot faster.
Because DataStage PX jobs push everything through a parallel engine you become
adept at sorts and partitions and Manapps would have worked this out with some
scenario testing.
Test 8 - Sort
A very interesting test result that shows how friggen fast the DataStage PX sort is.
When Applied Parallel Technologies (who became Torrent Systems and then IBM
DataStage PX) wrote a parallel flow based processing engine in 1993 one of the first
functions they wrote was sort it was an obvious candidate for running faster in
parallel mode. Fifteen years later and it flies:
Benchmark Job DataStage PX Sort
Its a simple DataStage job design, read from one file and write to another. The
properties of the second file insist on the data being sorted so you can see the little
5 of 12 21/01/2014 10:37
yellow sort symbol that tells you the data on that link is being sorted. This test had
two sequential files, the Achilles heel of parallel processing, but it was kind of like a
sprint relay with Father Christmas handing the baton Hussein Bolt who handed it to
Roseanne Barr. The sort in the middle made up the time.
With the result that DataStage PX was miles ahead:
Benchmark Result 8 Sort Speeds
My own benchmark tests showed DataStage PX was many times faster at sort and
aggregation than DataStage Server Edition even before you added any parallel
nodes. Its got very well written processing components. A 7 minute sort in Server
Edition took 12 seconds on one node in DataStage PX. This is the reason why
Co-Sort and Syncsort (the sequential file sorting specialists) are welcome at
DataStage Server sites and not DataStage PX sites. DataStage PX does not need
any help with sorts.
Talend used GNU sort external to the tool, and lost badly on the very high volume
sort. Maybe there is a better sort script out there. Looks like DataStage Server
Edition fell over on 20 million rows not a huge surprise. If you are sorting big data
volumes you need to upgrade to PX! Youll get a huge discount at the moment thanks
to the credit crisis, they are desperate for any extra licensing.
Test 9 ETL Aggregation
Test 9 is similar to test 8 but its aggregation instead of sort. One of the few tests
Informatica won, edging out DataStage in the 20 million category despite that initial
35-40 second flow start:

Benchmark Result 8 ETL Aggregation
Run Informatica Run! The trend line for Informatica is impressive not much increase
in time from 100,000 to 5,000,000. If they could break free of those leg braces earlier
they would be winning all categories.
The test developer made a mistake with the DataStage PX job in this test and left it
with two sorts instead of one:
6 of 12 21/01/2014 10:37
Benchmark Job DataStage PX Aggregation
They used the job from Test 8 that had an enforced sort in it instead of creating a new
job or using the job from test 1. The aggregator will add a sort it needs sorted data
in order to aggregate. The output sequential file is also asking for a sort (left over
from test 8), possibly in a different order to the aggregator, so this job is combining
test 8 and 9 into one and DataStage PX is still coming first or second in most results.
Could have been 10-20% faster without that second sort.
Test 10 Lookups
Sigh, this is where the benchmark really gets loopy. Mork and Mindy loopy. This job
is what you expect someone to build if they have only be using DataStage for a
couple hours:
Benchmark Job DataStage PX Join
Its a mess. These sorts and partitions are killers: . Lots of flour bag sorting
and swapping between wheelbarrows. This job would be as much as 80% faster if you
replaced that join with a lookup. The lookup stage does not need any sorting to work
and 9 out of 10 times it will be faster than a join. By default I use a Lookup stage and
I need something to go seriously wrong with the job before I switch to a Join. This job
design doesnt cut it for benchmarking.
Talend does best with a small lookup volume, DataStage PX does okay and
Informatica is astoundingly bad.
Benchmark Result 10 ETL lookup
Im no Informatica expert but the job design looked kind of crazy:

Benchmark Job Informatica Lookup
Can someone tell me what is wrong with it? Informatica lookups shouldnt be this
slow.
The one time you do want to use a DataStage PX Join stage instead of a Lookup is
when you have massive amounts of lookup data, and in this benchmark there was a
set of tests with 5,000,000 rows of lookup data and we finally got to see a Join stage
that was worthwhile:
7 of 12 21/01/2014 10:37
Benchmark Result 10 high volume ETL Lookup
This test has high volume input rows AND high volume lookup rows. We have
reached a volume of data that justifies a Join stage where data is sorted before the
comparison of rows is performed and you can see the scalability of DataStage PX
on 20 million input rows joined to 5 million lookup rows. This test gives you an idea of
what would happen with a job with many stages join, lookup, transform and sort.
DataStage PX would be further in front as the volumes go up and if you added more
CPUs the difference would be even more obvious.
Test 11 Lookups with Rejects
Test 11 is similar to test 10 except when you cannot join you produce a reject. This
has me so frustrated, I want to take Manapps out and beat them with a rake, this test
would have been so much better with a DataStage PX lookup stage. It can do the
lookup and reject in one step so much faster than the join stage that does it in two
steps with extra sorts.
By this stage of the benchmark the Informatica job is looking like the route home that
a drunk driver takes to avoid the police:
Benchmark Job Informatica Drink Driving
What the hell? DataStage PX in the hands of a drunk driver still manages to crash
into second place on high volumes but Im afraid the testers did not know enough
about lookups to do it justice. Informatica fared much worse in the hands of a novice
and I wait with bated breath to hear what was wrong with these job designs.
Conclusion
Thanks to Manapps for the benchmark but I would like to see the sequential file tests
run with DataStage on one node and the lookup tests with a lookup stage hey isnt
that a coincidence. Lookup test lookup stage. Who would have thought a lookup
stage would work for a lookup test?
Talend does a lot of its work in memory (like DataStage PX) but this starts to come
apart at the seams when the volumes go up. DataStage PX handles this by caching
and buffering. It would be interesting to see a benchmark going into the 100s of
millions of rows or 50 columns or more to see what each tool does under real stress.
The type of processing that is common for telcos, banks and insurers.
These tests do show that when you are down in the smaller volumes the open source
ETL tools are an option and I would prefer them to manual coding, but in the higher
volumes give me a premium tool any day. Even a novice can get good results.
8 of 12 21/01/2014 10:37
Read 7 comments
More White Papers
7 Comments
Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my
employer's view in any way.
Vincent McBurney is an IBM Information Champion for Information Integration.
Popular White Paper On This Topic
Best Practices for a BI and Analytics Strategy
Related White Papers
Passmark 2013 Benchmark Report
Endpoint Security Performance Results
ERP in Manufacturing 2011
Werner Daehn Dec 9, 2008
Would love to run that benchmark myself. In case you ever get the source files and
database tables let me know.
Personally I don't like the test either. How many GB of data is moved via flat files vs.
from source to target database? I guess the majority is database-to-database, hence
the file tests are nice and simple but do not help much as the parsing of the files can
be overly expensive, more expensive than the transformations - if there would be
any.
The other thing I am surprised is that there is a difference between the vendors. I
would have thought that for these copy operations with a lookup in the middle and
the such, the performance bottleneck would be the disk I/O. So I would immediately
have guessed that the flows are not correctly designed with each tool. Especially in
an ELT case, where the engine has almost nothing to do compared to the database,
the difference should be zero, shouldn't it?
But the most surprising statement was actually yours about the Oracle bulkload:
"its all about the speed of creating a sequential file. After youve got the file all five
ETL tools call the exact same Oracle bulk loader."
I know Informatica supports the Oracle API bulkloader, so no need to write any file.
Doesn't DataStage as well?
-Werner
Johannes Almiala Dec 10, 2008
I'm probably going to comment more later, but now a quick one for Test 11.
One thing I would have done differently with Informatica is that I would have used a
single Router transformation instead of four filters. A router does in one pass the
same as the four filter do in four passes, plus you catch the rows that don't match any
of the filter conditions. Also, there is no visibility on how the lookup has been
configured, it could easily be a bottleneck.
Generally, the default amount of memory allocated to transformation caches in
Informatica PowerCenter sessions is 5% (or 512 MB, which ever is smaller) of the
maximum available. If that hasn't been changed and the lookup source file is large,
this test will basically measure random disk reading speed on the server platform.
Vincent McBurney Dec 10, 2008
My Oracle bulk loader days go back to DataStage Server Edition and about Oracle 8!
That version wrote out a text dat file under the covers in the Oracle bulk loader data
format and passed the file to the Oracle bulk load program. DataStage PX has a
much newer Oracle Enterprise stage compatible with the newer versions of Oracle
but I don't know what it does under the covers. The bulk load test would be
9 of 12 21/01/2014 10:37
SUBMIT PREVIEW
interesting if the source was a database table so you could take sequential file
parsing out of the equation - and then bump the data volume up to 20 million rows.
Dec 14, 2008
It is interesting that the version of DataStage used in the benchmarking is two major
releases behind the current version 8.1. Along with little mention that DataStage is
the hands down winner in linear scaling of parallel jobs to available hardware by
simple changes to a configuration file. The fact that DataStage can scale seamlessly
beyond any other vendor in this test and that management of that scalability is least
costly in terms of hardware, installation, and IT resources is overlooked.
As mentioned it doesn't make a lot of sense to run any job in a parallel process when
the data volumes and transformative actions are minimized but once the volumes
increase or transformations expand beyond simple data mapping, the parallel engine
underlying the Information Server platform begins to easily out perform the other
vendors in the test.
In addition, no mention is made of the integrated platform Information Server brings
to the table as most the vendors in the test recognize the data integration is much
more than ETL. Granted my opinions are biased and all should evaluate these
results from their own perspective. The only point here is that taking a simple
scenario or two does not give the reader an accurate view of the products or
capabilities as each vendor can demonstrate where the benchmarked test deviates
from their best practices for each product.
USER_1963953 Apr 1, 2010
this benchmark has strange results; just did a mapping with informatica powercenter
8.6.1 that calculates 2 ranks on 34Millions of row, joins them with over 64Millions of
rows, then aggregates them and takes 130 secs consuming on average 5 power 5+
CPUs at 1900 mhz.
repeated the same with a larger volume of data 80M+ for the 2 ranks and 1 billion for
the outer join and the aggregation, and it takes 450 secs to execs, same cpu
consumption.
anyway to benchmark ETL is very difficult task because the results are too much
related to the skill of the developer and the knowledge of architecture of the product
btw: informatica lookups are slower than the joiners at least in 8.6.1 release, we will
see in informatica 9
Younes Siebel Oct 7, 2010
I think that Talend, when there is not huge informations to deal with, can simply be
the best.
But the things that make it more interesting is that it cost 0.00$, while DataStage
Server is more than 80.000,00$!
naresh ketepalli Aug 2, 2011
Can anyone tell me the architecture and features of Talend.
Leave a Comment
Connect to this blog to be notified of new entries.
10 of 12 21/01/2014 10:37
Browse all IT Blogs
We Recommend
Functional Design Specification
Document Template Part 1 - Intro
Merge / Upsert statement
4 Ways Mobile CRM Improves the
Quality of Customer Engagements
Password Management in the SAP
System
How to build a secure LAMP web server
with CentOS 5
Are Developers "The New Kingmakers"
in an App-Centric World?
From Around The Web
Why IT Is Responsible for Painful
Customer Experiences (TechViews)
Letting Go of Fear to Help the Creative
Process (Innovative Thinking System)
What Happened to Japanese
Innovation? (Innovative Thinking System)
Time Is More Than Just Money For The
Denver Broncos (Forbes.com)
Human trafficking the fastest growing
criminal industry (WALK FREE)

You are not logged in.
Sign In to post unmoderated comments.
Join the community to create your free profile today.
Want to read more from Vincent McBurney? Check out the blog archive.
Archive Category: Information Integration
Keyword Tags: etl benchmark manapps datastage informatica pentaho talend
Disclaimer: Blog contents express the viewpoints of their independent authors and are not reviewed for
correctness or accuracy by Toolbox for IT. Any opinions, comments, solutions or other commentary expressed
by blog authors are not endorsed or recommended by Toolbox for IT or any vendor. If you feel a blog entry is
inappropriate, click here to notify Toolbox for IT.
From Around The Web
Recommended by
Recommended by
Collaboration Tools
Discussion Groups
Blogs
Wiki
Toolbox for IT
My Home
Topics
People
Companies
Jobs
White Paper Library
Follow Toolbox.com
Toolbox for IT on Twitter
Toolbox.com on Twitter
Toolbox.com on
Data Center
Data Center
Development
C Languages
Java
Visual Basic
Web Design & Development
Enterprise Applications
CRM
ERP
PeopleSoft
SAP
SCM
Enterprise Architecture & EAI
Enterprise Architecture & EAI
Information Management
Business Intelligence
Database
Data Warehouse
Knowledge Management
Oracle
IT Management & Strategy
Emerging Technology & Trends
IT Management & Strategy
Project & Portfolio Management
Networking & Infrastructure
Hardware
Networking
Communications Technology
Operating Systems
Linux
UNIX
Windows
Security
Security
Storage
Storage
Topics on Toolbox for IT Toolbox.com
About
News
Privacy
Terms of Use
Work at Toolbox.com
Advertise
Contact us
Provide Feedback
Help Topics
Technical Support
AdChoice
Other Communities
Toolbox for HR
Hispanic Content
Marketing: Is it set to
explode?
(Portada-Online.com)
95% of professionals
don't know about this
email trick
(Frank Addante)
Google Penalty Hit
You Hard? Video
Reveal 3 Steps To
Overcome Penalty
(Kumar Setu)
The Real Problem In
Working From Home
(It's Not What You
Think)
(Forbes.com)
Mike Zammuto
launches ranking
service for 'Super
Blogs'
(Examiner.com)
Eight Ways to a Faster
Website
(ServInt)
Infographic: The Rise
of the Millennials
(Badgeville)
San Francisco:
Destination for top
talent and Mike
Zammuto
(Washington Times)
11 of 12 21/01/2014 10:37
Facebook Siebel Cloud Computing
Cloud Computing
Toolbox for Finance
Copyright 1998-2014 Ziff Davis, LLC (Toolbox.com). All rights reserved. All product names are trademarks of their respective companies. Toolbox.com is not
affiliated with or endorsed by any company listed at this site.
12 of 12 21/01/2014 10:37

ETL Benchmark Favours DataStage and Talend

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ETL Benchmark Favours DataStage and Talend

Enviado por

Direitos autorais:

Formatos disponíveis

Create an Account Log In

Blogs Discussions Research Directory

Você também pode gostar