A Formula To Estimate Hadoop Storage and Related Number of Data Nodes..

6/19/2014
A formula to estimate Hadoop storage and related number of data nodes... | LinkedIn
Kepner Tregoe Training - Learn Kepner-Tregoe Incident Management Methodology. First Time in India.
Hadoop Users
Discussions
Promotions
49,574 members
Jobs
Members
Member
Search
A formula to estimate Hadoop storage and related number of data

nodes...
Top Contributors in this Group
Slim Baltagi
Sr. Big Data Architect at TransUnion
Follow Slim
Hi
I would like to share with you a formula to estimate Hadoop storage and related
number of data nodes and get your thoughts about it.
1. This is a formula to estimate Hadoop storage (H):
Can T.
Business Intelligence Software Development Expert
at Vodafone
Follow Can
See all members
H=c*r*S/(1-i)
Your group contribution level
where:
c = average compression ratio. It depends on the type of compression used (Snappy,
LZOP, ...) and size of the data. When no compression is used, c=1.
r = replication factor. It is usually 3 in a production cluster.
S = size of data to be moved to Hadoop. This could be a combination of historical
data and incremental data. The incremental data can be daily for example and
projected over a period of time (3 years for example).
i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to
storing intermediate results of Map phases.
Start by commenting in a discussion. Group

participants get 4x the number of profile views.
Example: With no compression i.e. c=1, a replication factor of 3, an intermediate

factor of .25=1/4
H= 1*3*S/(1-1/4)=3*S/(3/4)=4*S
With the assumptions above, the Hadoop storage is estimated to be 4 times the size
of the initial data size.
2. This is the formula to estimate the number of data nodes (n):
n= H/d = c*r*S/(1-i)*d
where d= disk space available per node. All other parameters remain the same as in
1.
Example: If 8TB is the available disk space per node (10 disks with 1 TB , 2 disk for
operating system etc were excluded.). Assuming initial data size is 600 TB. n=
600/8=75 data nodes needed
What are your thoughts?
Thanks.
Getting Started
Nominate one of your co-workers for The Most

Fashionable Professional
You
Most Fashionable
Professional
You could stand a chance to win big too!
Nominate
Latest Activity
Sergey Lipchik , Petri
Salo, and 8 others
joined a group:
Slim Baltagi
Big Data Solution Architect at Syntel Inc.
Chicago, IL, USA
Like (23) Comment (28) Share Follow Reply Privately May 3, 2013
Comments
Hadoop Users
A group for users of
Hadoop and its subprojects.
28m ago
Roydon Pereira, Chandra Sekhar Reddy and 21 others like this
Ziad Bizri commented on a
28 comments Jump to most recent comment
discussion in Hadoop Users. Jules D.

Discardable Distributed
Memory: Supporting Memory
Storage in HDFS
bit.ly/1rqJQ2K Traditionally,
HDFS, Hadoop's storage subsystem,
https://www.linkedin.com/groups/formula-estimate-Hadoop-storage-related-988957.S.237869374
1/7
6/19/2014
has focused on one kind of storage
medium, namely spindle-based disks.
However, a Hadoop cluster can
contain significant amounts of memory
and with the... more Discardable
Madhan Sundararajan Devaki

Assistant Consultant at Tata Consultancy Services
Madhan
What is the intermediate factor and how did you arrive at 0.25?
When compression is not enabled and the replication factor is 3 the required storage will be 3
times the size of original data!
Like Reply privately Flag as inappropriate May 5, 2013
Slim Baltagi
Sr. Big Data Architect at TransUnion
Slim
i = intermediate factor. It is usually 1/3 or 1/4. It is Hadoop's working space dedicated to storing
intermediate results of Map phases. With no compression i.e. c=1, a replication factor of 3, an
intermediate factor of .25=1/4, the Hadoop storage is estimated to be 4 times the size of the initial
data size and not 3 times as you are mentioning.
Distributed Memory: Supporting

Memory Storage in HDFS
Traditionally, HDFS, Hadoop's storage
subsystem, has focused on one kind
of storage medium, namely spindlebased disks. However, a Hadoop
cluster can contain significant
amounts of memory and with the
continued drop in memory prices,...
2h ago
Like (1) Reply privately Flag as inappropriate May 6, 2013

bhaskara reddy. C. likes this
Shahab Yunus
J2EE Software Consultant at iSpace Inc.
Ambati MSRao likes a discussion in

Hadoop Users. Tauqeer A.
" i = intermediate factor. It is usually 1/3 or 1/4. "
HIPI for Object Recognition

Does anyone has experience
of working with HIPI for
recognition of objects? I am trying to
develop an object recognition system
in Hadoop. Any other library fulfilling
this requirement will also be... more
2h ago
Shahab
How have you reached this number? Depending on the nature of the jobs, algorithms or M/R
patterns (e.g. number of mapper), don't you think it can vary? Thanks.
Chandra Sekhar Reddy
Software Developer at Teradata
Chandra
Sekhar
See all activity
it's really good though of coming with some formula to calculate the no. of nodes by considering
the factors.
Subgroups
I think we have to consider no. of processors per node and no. of tasks per node as factors.
Brian
Cloudera Hadoop Users
Brian Macdonald
Enterprise Architect Specializing in Analytics using Big Data, Data Warehousing and
Business Intelligence Technologies
Cloudera Certified Hadoop

Professionals
* for intermediate space is not unrealistic. This is a common guidlines for many production
applications. Even Cloudera has recommended 25% for intermediate
results(http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/). Surely, they do
know a thing or two about production Hadoop systems.
The formula still holds true even if your want to plan for less intermediate space and use a smaller
value for i.
8,683 members
4,849 members
See more
About
Feedback
Privacy & Terms
LinkedIn Corp. 2014
I think this value is good for estimating the available HDFS storage available for the cluster. What I
remind customers to consider is that you will need to plan for output files as well. So the value of
S should include data being ingested as well as the output of any jobs.
SAURABH K., Slim B. and 2 others like this
Ted Dunning
Chief Application Architect at MapR
Ted
25% for intermediates is actually unrealistically large for very large datasets if only because it
would take a very long time to write so much intermediate data.
It is still good to have that much spare space because almost all file systems behave much better
with a fair bit of slack in their usage. Aiming at 75% fill is a good target to maintain very good
performance. I have seen some users run consistently at mid 90's fill with good results on MapR,
but I definitely don't recommend it.
Another consideration is how well do you want your system to run under various failure scenarios.
For small clusters, even a single node failure can be very significant (75% on a 5 node cluster
results in 94% fill with a single node failure ... is this system still going to work?). For larger
clusters, you may want to think in terms of single rack or switch failures.
Anton M., Shahab Y. and 1 other like this
2/7
6/19/2014
Michael Ware
TDB
Michael
Estimating compression ratios can be a very complex issue depending on the types of data your
working with. Some data can be compressed quite a bit more than others. If the data type your
storing in Hadoop isn't monolithic then this needs to be compensated for.
Jose Morales
Business Development | BigData | Storage Consultant | Cloud Computing | Europe |
Looking for New Challenges
Jose
Getting the compression data is very complex as mentioned by Michel Ware, but also I have to
agree that 25% for intermediate data is unreasonably large for any dataset, however if you feel
that your cluster will be reaching any ware near these numbers at deployment, then it is a good
idea to plan for a 75% allocated capacity for some headroom to expand the cluster accordingly.
How about meeting a particular ingest bandwidth requirement? I believe that should also be part of
the equation.
Like Reply privately Flag as inappropriate June 9, 2013
Anil Kumar
The Big Data Professional
Thanks for the information Dear Mr. Slim
Anil
Like Reply privately Flag as inappropriate 7 months ago

RAJDEEP MAZUMDER
Technical Architect at Verizon Communications
Thanks Slim for nice formulation!
RAJDEEP
For heterogeneous Hardware Configurations of different Nodes for Bigger Cluster does this flat
formulation will hold good or any difference? Though I can understand here we are trying to
estimate Number of Nodes based on Hadoop storage, which mainly focuses on disk size.
Thanks!
Peter Jamack
Global Big Data PSOM at Teradata
I'd like to know more about your formula and examples.
Peter
You have the formula for nodes as n= H/d = c*r*S/(1-i)*d
but then your example is simply n=H/d and 75 data nodes. What is the rest of that formula for
then in that piece?
Peter Jamack
I'm still confused on the data nodes needed according to your formulas.
Peter
The initial formula comes up with 2400 and 4X the initial size of the data. The initial size was
600TB. So if we then go and use 75 Data Nodes with only 8TB each, that works out to 600TB
worth of data stored. Except shouldn't that account for the fact the original formula needed 4X that
amount?
that's where I'm confused. You have 75 data nodes(8TB each) that can store 600TB worth of data.
But according to your original formula, you'd need 4X that amount of space.
Peter Jamack
That's where I"m confused.
Peter
If you look at n= H/d = c*r*S/(1-i)*d shouldn't the second part be /d instead of *d?
And shouldn't it be 300 Nodes instead of 75?
3/7
6/19/2014
Ted Dunning
Ted
You also have to allow headroom in the file system underlying the DFS. For HDFS, this is ext3 or
ext4 usually which gets very, very unhappy at much above 80% fill. For MapR FS, you can go up
to 90% and still get good performance.
You also need to allow headroom for failure modes. For example with 5 nodes, if you lose one,
the other 4 will get an extra 20% of the data as the replicas are re-built. This should not leave you
above the max fill rate mentioned in the first paragraph if you want to keep running efficiently.
Eswar Reddy
DWH and Big Data Developer at IBM
Eswar
Hello Mr Slim, Very nice explanation with formula of how we will have calculate the Hadoop
Storage and the no . of Clusters
Its much helpful . Thank you
Bhushan Lakhe
Senior Vice President at Ipsos
Bhushan
I think this is a good starting point! The 'intermediate factor' - which seems to be a debatable
variable, will of course vary as per the cluster activity (jobs etc.) and any job output that would add
to the disk space should be accommodated as part of 'data growth' - along with normal expected
increase in data volume.
Like (1) Reply privately Flag as inappropriate 3 months ago
Fernando Henrique T. likes this
Vijaya Tadepalli
Engineering Manager at Akamai Technologies
Great discussion here.
Vijaya
@Ted: How much does RAM contribute to processing speed? If we go with 64 GB RAM vs 128
GB RAM can we expect more lines to be processed per minute (while dealing with log files, for
example)?
Ted Dunning
RAM contributes to processing speed in two ways for batch oriented Hadoop programs.
Ted
The first way is to support mappers and reducers. You need to have enough space for your own
processes to run as well as buffer space for transferring data through the shuffle step. Small
memory means that you can't run as many mappers in parallel as your CPU hardware will support
which will slow down your processing. The number of reducers is often more limited by how much
random I/O the reducers cause on the source nodes than by memory, but some reducers are very
memory hungry.
The second way that RAM contributes to processing speed is as file space buffers. With
traditional map-reduce programs processing flat files (this includes Hive and Pig), you don't need
to worry much about memory buffering and 20% of total memory should be fine (this is the default
on MapR, for instance). If you are running map-reduce against table oriented code using, say,
MapR's M7 offering, you should have considerably more memory for table caching. The default on
MapR is 35% of memory allocated to file buffer space if tables are in use, but specialized uses I
have seen have gone as high as 80% of memory for buffering.
64GB should be enough for moderate sized dual socket motherboards, but there are definitely
applications where 128GB would improve speed. Moving above 128GB is unlikely to improve
speed for most motherboards.
Make sure you have enough spindles to feed the beast. The standard MapR recommendation
lately is 24 spindles for the data with an internal SSD or small drive for the OS. Some customers
with specialized needs have pushed over 100 drives on a single node. You should do a careful
design check before going extreme (above 36 drives) to make sure that you aren't compromising
reliability. Up to 24 should be fine in most cases as long as you have enough nodes. Keep in
mind that different distributions have different abilities to drive large drive count. Only MapR can
push 2+ GB/s, for instance, and many distributions have trouble with very large block counts on a
4/7
6/19/2014
single node.
Also look for networking speed. Network speed is moderately important during normal operations
but for the log processing that you suggest is your primary application, you can probably get by
with lower bandwidth for normal operations. Where network speed becomes critical is when you
are recovering from a node failure and need to move large amounts of data from node to node
without killing normal operations.
The ability to control file locations can also mitigate network bandwidth requirements. With MapR,
for instance, you can force all replicas of certain files to be collocated. This allows some programs
to be optimized into merge-joins which can sometimes eliminate several map-reduce steps and
can essentially eliminate network traffic for those parts of the jobs. In one large installation with
1Gbps networking, this resulted in 10x throughput improvements for that part of the data flow. This
idiom is common in some forms of log processing.
Vijaya Tadepalli
Vijaya
@Ted: Thats a lot of valuable information. Thanks much. I am actually at a very preliminary stage
of planning for 20 TB of data and the ask is that 20TB should be processed in less than 1 hour.
"Processing" here refers to couple of lookups for few fields and adding/removing additional fields. I
have this setup in my mind with hourly dump of 20 TB:
5 machines of 16 TB disk space with 64 GB RAM and replication factor of 3. Hadoop job's output
is set to Elasticsearch for indexing. Once hadoop job finishes, data gets deleted and
elasticsearch holds data from there on..Thanks again.
Guillermo Villamayor
New Project in Computer Vision
Guillermo
@Slim Baltagi whichi is the difference among H and S ? In the example H = 4*S
If your r=3 you need r*H = 3*H disk space or I'm wrong?
Vijaya Tadepalli
Any inputs to my proposed cluster setup for dealing 20 TB hourly? Thanks.
Vijaya

Ted Dunning
@Vijaya,
Ted
My advice is always to build a prototype before committing to SLA's.
When you say "lookups", what do you mean? Is the lookup against dynamic or static data? How
much data is the lookup going against?
Also, note that you are running pretty close to the wind here on total speed. If you assume that
you can read 1GB/s/node (which is hard to maintain if you are doing *any* significant processing
or if you have to write that much as well), then each node can handle 3600 s/hr x 1GB/s = 3.6 TB
/ hour. The cluster of 5 can handle 18TB/hour. So simply ingesting this much data per hour is
going to be doubtful even without processing it.
How did you imagine that this would happen?
What kind of networking are you assuming?
Amit M., Guillermo V. and 1 other like this
Ted Dunning
@Vijaya,
Ted
My advice is always to build a prototype before committing to SLA's.
When you say "lookups", what do you mean? Is the lookup against dynamic or static data? How
much data is the lookup going against?
5/7
6/19/2014
Also, note that you are running pretty close to the wind here on total speed. If you assume that
you can read 1GB/s/node (which is hard to maintain if you are doing *any* significant processing
or if you have to write that much as well), then each node can handle 3600 s/hr x 1GB/s = 3.6 TB
/ hour. The cluster of 5 can handle 18TB/hour. So simply ingesting this much data per hour is
going to be doubtful even without processing it.
How did you imagine that this would happen?
What kind of networking are you assuming?
Guillermo V. likes this
Vijaya Tadepalli
@Ted
Vijaya
Yes, definitely we will prototype and then commit to any SLA's.
At this moment, I just wanted to do a rough estimate of hardware required, especially in terms of
RAM as the ask is really about finishing processing and pushing data into elasticsearch within 1
hour. I understand 5 node cluster would be really running it tight for 20 TB but if I add couple of
nodes with same 16 TB disk and 64 GB RAM or even 128 GB RAM, that might do the job?
Lookups are against static data that is about 1 GB. Also, processing mostly involves massaging
each line and atleast at this point not looking at aggregations or counts. For networking, if my
setup can benefit from 10 gb(looks like it) I would go with that option. Hope all this is making
sense.
I haven't worked with Spark but that is something I am considering exploring.
Thanks.
Ted Dunning
Ted
I think that there is a good chance that more RAM would help by letting you use more cores. Fast
networking will definitely help.
The key questions that I have are:
a) can the source delivery this much data smoothly? If not, you will need more nodes simply
because you will need some peak bandwidth on the networking side. Systems like Flume are
fairly notoriously difficult to run at very high data rates.
b) do you have to persist this data during processing? Most forms of HA will require that. Many
require that you will have multiple persistent copies if you want to avoid data loss scenarios.
c) do you have enough spindles to meet your persistence requirements? 16TB of disk could be 16
x 1TB (better) or 4 x 4TB (very limiting since max I/O rate will only be about 400-600MB/s at best
due disk limits).
d) can the destination accept data smoothly (or even at all)? My experience with high data rates
into Elastic Search indicate that a sophisticated team can ingest just north of 2 million log events
per second using about 30 nodes of ES. You sound like you are in the same territory or even
higher, but you don't account for this level of hardware.
Tej L., Ganesh T. like this
Vijaya Tadepalli
Vijaya
@Ted:
#a: The expectation is that processing is done within 1 hour after all the data is available in
hadoop cluster. I still have to work on establishing robust pipeline into hadoop. May be scribe,
flume or something like logstash to apply basic regex and push into hdfs, not finalized yet.
#b & #d: No need to persist on hadoop cluster, but whatever gets pushed to elasticsearch needs
to be though. Need to plan for elasticsearch cluster depending on how much data needs to be
available. For example, if 12 hours of data(250 TB) is needed at any given point of time, and with
replication factor of 3, I need to look at how much space is needed for index built by elasticsearch
and then take a call on total disk space required. I am sure RAM would come into picture in the
case of elasticsearch too. All this is still in works.
6/7
6/19/2014
#c: Will definitely look into getting more spindles to get better I/O performance.
Again, thank you forHome
the valuableProfile
inputs and suggestions
will add more
details once I can
Connectionsso far. IJobs
Interests
finalize on collection and elasticsearch aspects. Others facing similar asks might find it useful.
Search for people, jobs, companies, and more...
Business Serv
Advanced
Raghavan Solium
Lead Architect - Big Data at OSSCube
I really appreciate attempting to formulate the number of nodes. Over all as a formula for
calculation looks good from the data point of view though the intermediate factor is slightly higher.
But this totally ignores the processing requirement. You need to keep in mind that the data nodes
are also going to be processing nodes hence you may need to either increase the number of
nodes beyond this calculation based on your job loads and performance requirements. A formula
which captures all these aspects will do great. A slight nudging of your formula to include the
processing aspects would be good.
Feedback
Raghavan

Ted Dunning
Raghavan,
Ted
You are correct that it is important to consider processing space as well, especially for nodes
with relatively small amounts of storage.
Processing space should not, however, be computed as a fraction of total space, but in terms of
how long the longest job runs. Each node can only write data at a bounded rate and thus the
length of the job determines how much space you need.
If you assume, for instance, that the most extreme job writes for, say, 20 minutes (this is not
processing time, just the more serious write), then a single node probably doesn't need more than
about a TB of processing space (1000 s * 1GB/s).
In any case, that should be included in the usable storage that I computed earlier.
And, as always, your mileage will vary.
Mike T., Jose M. like this
Add a comment...
Send me an email for each new comment.
Add Comment
Online MBA - Switzerland

Suggested
discussion
Accredited online Master
Degree
(MBA) from
Switzerland, APPLYWhat
NOWis the single most
important quality for a
Ads You May Be Interested In
PM to have?
Help Center
About
Press
LinkedIn Corporation 2014
Blog
Careers
User Agreement
Advertising
Privacy Policy
Talent Solutions
Telecom SaaS
Comverse BSS Can Be Deployed In The Cloud
Via SaaS Model. Learn More!
Small Business
Community Guidelines
Cookie Policy
Mobile
Developers
Copyright Policy
Language
Upgrade Your Account
Send Feedback
7/7

A Formula To Estimate Hadoop Storage and Related Number of Data Nodes..

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Formula To Estimate Hadoop Storage and Related Number of Data Nodes..

Enviado por

Direitos autorais:

Formatos disponíveis

6/19/2014

A formula to estimate Hadoop storage and related number of data

Top Contributors in this Group

Your group contribution level

Start by commenting in a discussion. Group

Example: With no compression i.e. c=1, a replication factor of 3, an intermediate

Nominate one of your co-workers for The Most

You could stand a chance to win big too!

Roydon Pereira, Chandra Sekhar Reddy and 21 others like this

Ziad Bizri commented on a

28 comments Jump to most recent comment

discussion in Hadoop Users. Jules D.

Madhan Sundararajan Devaki

Distributed Memory: Supporting

Like (1) Reply privately Flag as inappropriate May 6, 2013

Ambati MSRao likes a discussion in

" i = intermediate factor. It is usually 1/3 or 1/4. "

HIPI for Object Recognition

See all activity

Like Reply privately Flag as inappropriate May 7, 2013

Cloudera Hadoop Users

Cloudera Certified Hadoop

Privacy & Terms

LinkedIn Corp. 2014

Like Reply privately Flag as inappropriate 7 months ago

Like Reply privately Flag as inappropriate 2 months ago

Like Reply privately Flag as inappropriate 2 months ago

Like Reply privately Flag as inappropriate 2 months ago

Online MBA - Switzerland

LinkedIn Corporation 2014

Upgrade Your Account

Você também pode gostar