Escolar Documentos
Profissional Documentos
Cultura Documentos
A general strategy for analyzing the bottlenecks is to use both service metrics (protocol/volume/lun
latency) and component metrics (CPU, Disk IO, Network IO) to provide a holistic view of the system
and reduce the chance of making a false conclusion.
But, to begin with, it makes sense to understand How Data ONTAP makes use of multiple CPUs.
When you run 'sysstat -M 1' you can see CPU statistics across these domains:
Network
Protocol
Cluster
Storage
Raid
Target
Kahuna
WAFL_Ex(Kahu)
Domain bottleneck is reached when a single domain reaches 100% utilization. [Ex- Network, Storage,
Raid, Target, Kahuna ]
HIGH CPU does not always suggest problem in the filer. For example On a Multi-Processor Filer
the output of sysstat x 1 may be quite deceiving bcos its not showing the AVG utilization
percentage which is more true indicative of system performance.
USEFUL KBs
Nice to know
FACT:
A high CPU on a Storage Controller does not always mean CPU bottle neck or performance
problem. In Data ONTAP, a high CPU means only that it is doing lot of work. If the Storage controller
is not busy with user protocols workload, it is doing background work like deswizzling or disk
scrubbing etc. But if user workload is introduced into this system, Data ONTAP is able to throttle this
scanner work down in order dedicate the CPU to user workload.
FACT: During Disk scrubbing, system will be checking the disk blocks of all disks for media errors
and parity consistency. If Data ONTAP finds media errors or inconsistencies, it fixes them by
reconstructing the data from other disks and rewriting the data and that's the reason you see the
CPU Load high that time. To minimise the performance impact, you can schedule the disk scrub to
non-peak hours or change your RAID scrub speed to Low by using.
filer>options raid.scrub.perf_impact low
WAFL SCAN
There are many backgrounds WAFL scans for internal Filesystem maintenance. As a result one might
"see" read/write activity in sysstat -x 1 command output. wafl scan is one of them which is always on
and prioritized to run when the filer is idle.
Volume vol0:
Scan id
Type of scan
progress
This is normal!
Note: Dont forget to enable print logging 'on' in the putty session, as the output will often exceed
the screen length. Also, note that certain commands may not be available under 'Admin prompt
[priv set admin]', you may have to go to advance level such as '[priv set advanced] or [priv set diag]'.
TIP: If you are not sure or confident about running these commands on the production filer, then
always keep a SIMULATOR running by your side. This way, you can run these commands on the
SIMULATOR and get your confidence level up a bit and before going about your business.
This command will give you over all stats per second [You can change the internal by providing
different value such as 2,3,5,6 etc. for ex sysstat -x 5]
filer>sysstat -x 1
Gives you a second-by-second readout of the filers performance. In particular look at the CP Time
and CP Type if youre constantly hitting 100% CP Time and the CP Type is showing lots of Bs (back
to backs) this indicates that the NVRam cache is being flooded and the filer is struggling to write all
the incoming data quickly enough. This conditions is also called -Deferred back to back CPs (CP
generated CP) (This probably indicates that the condition is getting worse)
filer>statit -b
filer>statit -e
This command gives detailed stats of filer disk performance. The first begins (-b) the performance
snapshot and the second ends (-e) it. The output can indicate which disks are being hammered.
You may also refer to following pdf [Monitoring Storage Performance using NetApp Operations
Manager]
http://media.netapp.com/documents/tr-4090.pdf
filer>sysstat -m 1
Only if you see AVG CPU Percentage @ 100 % consistently that you need to be concerned and talk
to Netapp and check if you are hitting the BUG..
Kahuna bottleneck
The sum of the Kahuna domain and the (Kahu) from the WAFL_Ex domain reach 100% utilization.
To check how all the CPUs are doing across all domains:
filer>sysstat -M 1
In this example below: I have circled 'kahuna domain' and squared 'kahu' just to make it clear.
In this example Kahuna domain + ( kahu) adds up to 95 & 96 percentage, which is quite high but
not above 100% mark yet.
IMP: Kahuna processes and (Kahu) processes cannot run simultaneously, so a potential Kahuna
bottleneck occurs when the Kahuna value and the (Kahu) value add up to 100%.
If you are unable to make sense of all this, do not worry, just contact NetApp technical Phone or
Email Support, they are really good. In most cases, they will ask you to collect the logs and upload
it to the NetApp support site.
To help you do this, NetApp support will direct you to following tools for log collection:
Tool : Perfstat
C:\>perfstat -f [filer] -t 5 -i 6 > [case number].perfstat.out
Download the perfstat tool from the NetApp Support Site Perfstat tool.
http://support.netapp.com/NOW/download/tools/perfstat/
Tool: NSanity
Collects details of all SAN related components for end-to-end diagnosis.
For full command info check the NSanity page on the NOW site.
http://support.netapp.com/NOW/download/tools/nsanity/
IMPORTANT TIP: Whenever you open a bug page in the NetApp Support site, always go to the
link at the bottom of the 'Fixed-In Version' section, Titled: A complete list of releases where this bug
is fixed is available here. This is bcos the Fixed-In version section may not contain the complete list
of Data ONTAP versions that are fixed.
BUG: 698798: High CPU utilization with many concurrent 'block ownership' and 'blocks used'
scanners
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=648017
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=698798
[Note: The BUG 648017 is fixed in the release since 8.1.2P3 onwards, so that indicates this bug is
present in 8.1.2, but having said that, it doesnt mean that you are hitting this BUG.]
C-MODE BUG: 595957:High CPU utilization on Cluster-Mode storage systems that have high
number of SAS shelves and disks
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=595957
BUG: 590193:WAFL background file system scanner may cause high CPU usage.
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193
Courtesy: NetApp
ashwinwriter@gmail.com
Jan, 2014