Você está na página 1de 10

NetApp CPU Bottleneck Issues

Some help when dealing with CPU bottleneck issues

A general strategy for analyzing the bottlenecks is to use both service metrics (protocol/volume/lun
latency) and component metrics (CPU, Disk IO, Network IO) to provide a holistic view of the system
and reduce the chance of making a false conclusion.
But, to begin with, it makes sense to understand How Data ONTAP makes use of multiple CPUs.

Data ONTAP operating system implements coarse-grained symmetric multiprocessing (CSMP).


What that means is that - Data ONTAP handles processes across multiple CPUs and these processes
are divided into different domains, but the key information to know is that although different
domains can run simultaneously on different processors, each individual domain can only exist on a
single CPU at any one time. This is useful, because it means that any domain showing 100% usage
indicates a CPU bottleneck for that bundle of related processes.

When you run 'sysstat -M 1' you can see CPU statistics across these domains:

Network
Protocol
Cluster
Storage
Raid
Target
Kahuna
WAFL_Ex(Kahu)

Domain bottleneck is reached when a single domain reaches 100% utilization. [Ex- Network, Storage,
Raid, Target, Kahuna ]

HIGH CPU does not always suggest problem in the filer. For example On a Multi-Processor Filer
the output of sysstat x 1 may be quite deceiving bcos its not showing the AVG utilization
percentage which is more true indicative of system performance.

What is Processor utilization?


Processor utilization is nothing but the percentage of time the processor is busy.

For example Sysstat x 1 is showing very high % age

Whereas, sysstat m 1 shows rather normal figures

USEFUL KBs

Block reclamation scanners cause kahuna bottleneck.


http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=210480
What is the 'wafl scan status' command?
https://kb.netapp.com/support/index?page=content&id=3011346
How does Data ONTAP make use of multiple CPUs?
https://kb.netapp.com/support/index?page=content&id=3010150
[Apparently this KB: 3010150 is removed from the NetApp Support site]
What causes High CPU during disk scrub although raid.scrub.perf_impact is set to low?
https://kb.netapp.com/support/index?page=content&id=3011323
Data ONTAP 8: sysstat shows high CPU utilization on multiple processor system
https://kb.netapp.com/support/index?page=content&id=2013653
How does Data ONTAP schedule work across multiple physical CPUs?
https://kb.netapp.com/support/index?page=content&id=3010118
[Apparently this KB: 3010150 is removed from the NetApp Support site]
If the Filer acts as a snapmirror destination, then it is busy running the Deswizzler after a snapmirror
upgrade which can cause high CPU usage. By the way, what is deswizzler or deswizzling?
https://kb.netapp.com/support/index?page=content&actp=LIST&id=3011866
You can monitor the deswizzler work with the command wafl scan status:
https://kb.netapp.com/support/index?page=content&id=3011346
Diagnosing NetApp CPU Issues Kahuna Bottlenecks
http://dosysadminsdream.wordpress.com/2013/01/24/diagnosing-netapp-cpu-issues-kahunabottlenecks/

Nice to know

FACT:

A high CPU on a Storage Controller does not always mean CPU bottle neck or performance

problem. In Data ONTAP, a high CPU means only that it is doing lot of work. If the Storage controller
is not busy with user protocols workload, it is doing background work like deswizzling or disk
scrubbing etc. But if user workload is introduced into this system, Data ONTAP is able to throttle this
scanner work down in order dedicate the CPU to user workload.

FACT: During Disk scrubbing, system will be checking the disk blocks of all disks for media errors
and parity consistency. If Data ONTAP finds media errors or inconsistencies, it fixes them by
reconstructing the data from other disks and rewriting the data and that's the reason you see the
CPU Load high that time. To minimise the performance impact, you can schedule the disk scrub to
non-peak hours or change your RAID scrub speed to Low by using.
filer>options raid.scrub.perf_impact low

WAFL SCAN
There are many backgrounds WAFL scans for internal Filesystem maintenance. As a result one might
"see" read/write activity in sysstat -x 1 command output. wafl scan is one of them which is always on
and prioritized to run when the filer is idle.
Volume vol0:
Scan id

Type of scan

progress

213 active bitmap rearrangement

fbn 1513 of 2230 w/ max_chain_len 3

This is normal!

NetApp performance Diagnosis commands

Note: Dont forget to enable print logging 'on' in the putty session, as the output will often exceed
the screen length. Also, note that certain commands may not be available under 'Admin prompt
[priv set admin]', you may have to go to advance level such as '[priv set advanced] or [priv set diag]'.
TIP: If you are not sure or confident about running these commands on the production filer, then
always keep a SIMULATOR running by your side. This way, you can run these commands on the
SIMULATOR and get your confidence level up a bit and before going about your business.
This command will give you over all stats per second [You can change the internal by providing
different value such as 2,3,5,6 etc. for ex sysstat -x 5]

filer>sysstat -x 1

Gives you a second-by-second readout of the filers performance. In particular look at the CP Time
and CP Type if youre constantly hitting 100% CP Time and the CP Type is showing lots of Bs (back
to backs) this indicates that the NVRam cache is being flooded and the filer is struggling to write all
the incoming data quickly enough. This conditions is also called -Deferred back to back CPs (CP
generated CP) (This probably indicates that the condition is getting worse)

filer>priv set diag

filer>statit -b

Then wait 5 secs then

filer>statit -e

This command gives detailed stats of filer disk performance. The first begins (-b) the performance
snapshot and the second ends (-e) it. The output can indicate which disks are being hammered.
You may also refer to following pdf [Monitoring Storage Performance using NetApp Operations
Manager]
http://media.netapp.com/documents/tr-4090.pdf

NetApp Storage Monitoring Using HP OpenView


http://www.netapp.com/us/media/tr-3688.pdf

Average CPU HIGH Bottleneck


To check how all the CPUs are doing:

filer>priv set diag

filer>sysstat -m 1

sysstat -m displays per-processor and average utilization.


The ANY column in sysstat -m output shows the percentage of the time that one or more CPUs were
busy. In addition to this, the utilization of each individual processor is displayed, as well as the
average (AVG).
As long as average CPU is not 100%, there is nothing to worry about. NetApp Oncommand
Performance Advisor might show CPU as high as 100% consistently but do not panic, its just
plotting the percentage of the time that one or more CPUs were busy.
As you can see AVG CPU is pretty NORMAL.

Only if you see AVG CPU Percentage @ 100 % consistently that you need to be concerned and talk
to Netapp and check if you are hitting the BUG..

Kahuna bottleneck

The sum of the Kahuna domain and the (Kahu) from the WAFL_Ex domain reach 100% utilization.
To check how all the CPUs are doing across all domains:

filer>sysstat -M 1

In this example below: I have circled 'kahuna domain' and squared 'kahu' just to make it clear.

In this example Kahuna domain + ( kahu) adds up to 95 & 96 percentage, which is quite high but
not above 100% mark yet.

IMP: Kahuna processes and (Kahu) processes cannot run simultaneously, so a potential Kahuna
bottleneck occurs when the Kahuna value and the (Kahu) value add up to 100%.

It is important to keep a watch on this domain percentage; it will be a matter of concern if it


consistently remains at 100% for days together. In most cases, this will get normalized in few
hours. Hence, do not panic.

Reach Out to NetApp Support

If you are unable to make sense of all this, do not worry, just contact NetApp technical Phone or
Email Support, they are really good. In most cases, they will ask you to collect the logs and upload
it to the NetApp support site.
To help you do this, NetApp support will direct you to following tools for log collection:

Tool : Perfstat
C:\>perfstat -f [filer] -t 5 -i 6 > [case number].perfstat.out
Download the perfstat tool from the NetApp Support Site Perfstat tool.
http://support.netapp.com/NOW/download/tools/perfstat/

Tool: NSanity
Collects details of all SAN related components for end-to-end diagnosis.
For full command info check the NSanity page on the NOW site.
http://support.netapp.com/NOW/download/tools/nsanity/

How to upload a file to NetApp


https://kb.netapp.com/support/index?page=content&id=1010090

BUGs that are linked to HIGH CPU Utilization

IMPORTANT TIP: Whenever you open a bug page in the NetApp Support site, always go to the
link at the bottom of the 'Fixed-In Version' section, Titled: A complete list of releases where this bug
is fixed is available here. This is bcos the Fixed-In version section may not contain the complete list
of Data ONTAP versions that are fixed.

As shown in the figure below:

BUG: 698798: High CPU utilization with many concurrent 'block ownership' and 'blocks used'
scanners
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=648017
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=698798
[Note: The BUG 648017 is fixed in the release since 8.1.2P3 onwards, so that indicates this bug is
present in 8.1.2, but having said that, it doesnt mean that you are hitting this BUG.]

BUG:91653: Volume SnapMirror source has high CPU usage


http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=91653

BUG:110630: Wildcard searches from CIFS on large directories are CPU-intensive


http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=110630

C-MODE BUG: 595957:High CPU utilization on Cluster-Mode storage systems that have high
number of SAS shelves and disks
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=595957

BUG: 590193:WAFL background file system scanner may cause high CPU usage.
http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193

BUG:164124: Kerberos replay cache can cause high CPU usage


http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=164124

Courtesy: NetApp

ashwinwriter@gmail.com
Jan, 2014

Você também pode gostar