Você está na página 1de 14

Disk path design for AIX including SAN zoning

The purpose of this document is to describe how to design the SAN setup to keep the
number of disk paths to a reasonable level. Other factors include the use of VIO, NPIV and
dual SAN fabrics. Setting up the storage involves the storage, SAN and AIX administrators,
so we'll look at it from those perspectives. The article will examine how to determine the
number of possible paths based on the SAN cabling, cover the concepts of SAN zoning, LUN
masking, host port groups and how these affect disk paths, and look at disk paths with VIO
using VSCSI or NPIV. The general case includes some number of available host HBA ports
and some number of storage ports, and while this article doesn't cover the general case, the
author hopes that by going thru a few specific examples, you'll have the skills to handle
your situation.

Overview

The basic approach taken here is to group the sets of host and storage ports into a number
of subsets, and then using SAN zoning and or LUN masking, connect a set of host ports to
storage ports to keep the number of paths to a reasonable level. Then the full set of LUNs
for a server are also split into the same number of subsets, and assign LUNs in one subset
to use one subset of host and storage ports. You may not find this description of the
approach to be very clear, and to fully understand it, we'll go over the concepts of LUN
assignment, SAN zoning, LUN masking and disk paths which is needed to understand the
approach.

In the setup of SAN storage, LUNs are created on a disk subsystem then assigned to a
group of world wide names (WWNs) representing fibre channel (FC) ports on a host (or
group of hosts in clustered applications). WWNs have two versions which are clled the
world-wide port name (WWPN) that refers to a specific port on a system and the world-wide
node name (WWNN) which can refer to the entire system and all of its ports. Most storage
systems will only work with a host's WWPN when configuring host access to a LUN. A WWN
is a 16 digit hexadecimal number that is typically burned in to the hardware by the
manufacturer. The group of WWNs is called various things depending on the disk
subsystem, e.g., on the DS5000 it is called a "storage partition," "host" or "host group." The
SVC calls this a "host object," XIV calls this a "host" or "cluster." And the DS800 calls this a
"port group." For this article, we'll use the phrase "host port group." It's also worth knowing
that when purchasing some disk subsystems, such as the DS5000, one must specify the
number of host port groups to which the disk subsystem will be connected.

The terminology for assigning or mapping a LUN to a host port group also varies across disk
subsystems, and, depending on the disk subsystem, you may be able to control which
storage ports are used to handle IO. Controlling which storage ports are used to handle IO
for a LUN is called LUN masking (or port masking). The storage might have additional
terminology which can also lead to confusion. E.G., on the DS8000 a group of LUNs which
are assigned as a group is called a "volume group" which shouldn't be confused with a LVM
volume group.

The AIX administrator can display the WWN for an adapter with the # lscfg -vl <fcs#>
command. SAN switch administrators can also see the WWNs attached to the switch:

# lscfg -vl fcs0


fcs0 U789D.001.DQD51D7-P1-C1-T1 4Gb FC PCI Express Adapter (df1000fe)
Alternatively one may use the fcstat command:

# fcstat fcs0

FIBRE CHANNEL STATISTICS REPORT: fcs0

Device Type: FC Adapter (adapter/pciex/df1000fe)


Serial Number: 1F8240C89D
Option ROM Version: 02E8277F
Firmware Version: Z1D2.70A5
World Wide Node Name: 0x20000000C97710F3
World Wide Port Name: 0x10000000C97710F3

....

To understand the number of paths to a LUN. Figure 1 shows a server with two fibre
channel (FC) ports and three FC ports on the storage with a single FC switch. In this setup,
each LUN can have up to 6 paths, or number of host ports times the number of storage
ports.
Figure 1

Many disk subsystems also allow you to specify which disk subsystem ports will be used to
handle IOs for a LUN, and this is known as LUN masking. And we can zone the SAN so that
the number of paths are further reduced.

There are trade offs here. It'd be simpler to just use all ports on both the storage and
server, and with a load balancing algorithm on the host, we'd balance use of all resources
here: server ports, SAN links, and storage ports. However, with too many paths, overhead
of path selection can reduce performance.

It's worth pointing out that the algorithm and path control module (PCM) used for path
selection will affect the overhead of path selection. For example, with an
algorithm=fail_over with MPIO, the path is fixed so no cycles are used for path selection.
Load balancing algorithms vary in how they choose the best path. They might choose the
path with the fewest number of outstanding IOs, they might examine IO service times and
choose the path providing the best IO service times, or may just choose a path at random.
And often the method isn't documented.

SAN Zoning

A SAN zone is a collection of WWNs which can communicate with each other. In figure 1,
there are 5 WWNs: two for the host ports and three for the storage ports. There are
initiators (host FC ports) and targets (storage ports), with the terminology arising as
initiators initiate an IO, and the request is sent to a target. There are various kinds of
zoning: soft and hard and WWN and port zoning (e.g.
seehttp://en.wikipedia.org/wiki/Fibre_Channel_zoning), and the subject is much broader
than will be covered here. However, we do need to understand some basic SAN zoning to
understand how it, and LUN masking, are related to disk paths. WWN zoning is relatively
popular and so we'll examine it first.

There is a difference between WWPN and WWN; however, they are often used
interchangeably, which can lead to some confusion. Zoning best practices indicate that
WWPNs should always be used when implementing the WWN zoning method.

The wikipedia article states that "With WWN zoning, when a device is unplugged from a
switch port and plugged into a different port (perhaps on a different switch) it still has
access to the zone, because the switches check only a device's WWN - i.e. the specific port
that a device connects to is ignored." A dual port host FC adapter will have one node name
(or WWN or WWNN) but have two WWPNs. So this is a benefit of WWN zoning. It's also
worth noting that to achieve this capability the fscsi device attribute dyntrk must be set to
yes (the default is no). . Best practice for WWN zoning is to use "single initiator zoning"
where an initiator is a FC port on the host. Such a zone can include multiple targets. A
major reason for this, is that as initiators are added to a zone (e.g. when a host is booted)
the initiator logs into the fabric (aka. a PLOGI) and this causes a slight delay with other
ports in the zone, and the delay is longer with larger zones. Another benefit of single
initiator zoning, is that this minimizes issues that may result from a faulty initiator affecting
others in the zone. You can think of an initiator as a fibre channel port on a host as it
initiates the IO, and a target as a port on the storage as the IO is targeted to it.

So best practice for Figure 1 would be to have two zones: one host port and typically the
three storage ports in one zone, and the other host port and three storage ports in the other
zone:

Zone 1:host port 1, storage port 1, storage port 2, storage port 3


Zone 2: host port 2, storage port 1, storage port 2, storage port 3

Note that some disk subsystems have restrictions on zoning, such as the DS4000 when
using RDAC where all host ports should not be zoned to all storage ports.

Figure 2 shows the difference between WWN zoning and switch port zoning. The shaded
area in the figure represents a single initiator zone, allowing communication to two storage
ports. There would be similar zones for each host port: host port 2 communicating with
storage ports 1 and 2, while host port 3 and 4 communicating with storage ports 3 and 4.
The switch port zones allow host ports 1 and 2 to communicate with storage ports 1 and 2,
and similarly for host ports 3 and 4 to storage ports 3 and 4. So we've achieved the same
connections from the server to the storage, but using different zoning approaches. WWN
zoning offers the flexibility to move cables from one port on a switch to another, while
switch port zones do not. It is possible to have a single switch port in multiple port zones,
so that offers some flexibility. With NPIV, where vFC adapters move around during a live
partition migration (LPM), WWN zoning also offers the ability for the zone to follow the vFC
adapter.
Figure 2

Customers often implement dual SAN fabrics. This has the benefit that if there's a problem
with one fabric, hosts will still have access to storage via the other fabric. Figure 3 shows a
common dual SAN setup:
Figure 3

Note that in this example, we could have up to 8 paths to a LUN, 4 thru each SAN fabric.
Also, while each SAN has just one switch, it's possible to have more. This also shows an
example of a belt and suspenders setup, as a single connection from the host to each
switch, as well as a single connection from the storage to each switch provides full
redundancy and we'd only have two paths for each LUN in that case. However, if one SAN
fabric fails and the wrong link, port or adapter fails, then the host can lose access to the
storage. SAN zoning would comprise 4 zones, one for each host port, and include just 2
storage ports in each zone.

Note that if the customer is using dual port adapters in the host, the best practice would be
to attach one port of an adapter to one fabric and the other port to the other fabric. This
enhances availability in that if both an adapter and SAN fabric fail, we won't lose access to
the storage.

Disk Paths with VIO and VSCSI LUNs

There are two layers of multi-path code in this case: the multi-path code on the VIO server
(VIOS) which will be what the storage supports (often there's a choice, e.g. SDD or
SDDPCM for DS8000), and the multi-path code on the VIO client (VIOC) which will be MPIO
using the SCSI PCM that comes with AIX. On the VIOS, the multi-path code is used to
choose among the FC paths to the disk, while on the VIOC it's used to choose among paths
across dual VIOSs. The fact that there are two layers of multi-path code often leads to some
confusion.

In a dual VIOS setup, LUNs will typically have multiple paths to each VIOS, and the VIOC
can use paths thru both VIOSs. It's worth knowing that all IO for a single LUN on a VIOC
with VSCSI, will go thru one VIOS (except in the case of a VIOS failure or of all its paths to
the disk). One can set a path priority so that IOs for half the LUNs will use one VIOS, and
half will use the other. This is recommended to balance use of the VIOS resources (including
its HBAs) and get the full bandwidth available. Figure 2 shows a dual VIOS setup.
Figure 4

The storage and SAN administrators will zone LUNs to adapters on both VIOSs. And the AIX
administrator will map LUNs from each VIOS to the VIOC. For the example in Figure 3, each
VIOS potentially has 4 paths to each LUN. The VIOC however, has sees only two paths for
its hdisks and those are the paths to the VIOSs. Using single initiator zoning, there would be
4 zones. So this case essentially reduces to that of a server (here the VIOS) with 2 HBAs,
and we only need concern ourselves with disk path design at the VIOS, since the VIOC will
always have exactly two paths for each LUN in a dual VIOS setup.

Disk Paths with VIO and NPIV LUNs

With NPIV, only one layer of multi-path code exists, and that's at the VIOC. The multi-path
code used will be specified by the storage vendor. One advantage of this is that IOs for each
LUN can be balanced across VIOSs, and one doesn't need to set path priorities. With NPIV in
figure 3, the AIX/VIO administrator would typically create 4 virtual FC (vFC) adapters. So
each LUN potentially has 8 paths. Zoning would be the same as in the VSCSI case (4 zones,
one for each vFC).

One difference between NPIV and VSCSI, is that when creating a vFC, two WWNs are
created to support Live Partition Migration (LPM). So the storage administrator assigns the
LUN to both WWNs. As the partition moves around among systems, the two WWNs will be
used alternatively. Both WWNs for the vFC can be used in a single zone if the number of
zones needs to be limited.

Disk Path Limits and Recommendations


The MPIO architecture doesn't have a practical limit to the number of paths (i.e. you can
have far more paths than you need), though there are limits with SDDPCM which uses
MPIO. The latest SDDPCM manual states that SDDPCM supports a maximum of 16 paths per
device. More paths also require more memory and can affect the ability of the system to
boot, so there is a limit there
(seehttp://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.basea
dmn/doc/baseadmndita/devconfiglots.htm) but this applies to configurations with so many
devices that managing them is likely to be impractical, and most customers will have far
fewer than the maximum.

SDDPCM on the other hand does have a limit of 16 paths per LUN. The SDDPCM manual
also states that "with the round robin or load balance path selection algorithms, configuring
more than four paths per device might impact the I/O performance. Use the minimum
number of paths necessary to achieve sufficient redundancy in the SAN environment." So
the downside is that CPU and memory overhead to handle path selection will increase with
more paths, and this will also have a very slight impact on IO latency; however, at this time
the author has been unable to find data showing the overhead. Error recovery time (for
failed paths, host ports or storage ports) may be a bigger factor with more paths, but data
is lacking to make specific statements on overhead at the time this document was written.

The path selection algorithm also affects the overhead. Algorithms that load balance across
paths might do so based on the number of outstanding IOs on the path, or perhaps average
IO service times for IOs on a path. For example, SDDPCM offers the load_balance and
load_balance_port algorithms. These load balancing approaches require capturing IO data
and calculating the best path; consequentially, requires more overhead than a round_robin
algorithm which just keeps track of the last path used, or a fail_over algorithm which uses
the available path with the highest priority.

Bandwidth Sizing Considerations

Bandwidth sizing considerations include the number of host adapters, storage adapters, SAN
links and SAN link speeds (currently 1, 2, 4, 8 or 10 Gbps). Host and storage adapters have
limits on the number of IOPS they can perform for small IOs, and also a limit on the thruput
in MB/s for large IOs. The adapter IOPS limits depend on the processors on the host/storage
adapters (assuming the IOPS bandwidth isn't limited by the number of disk spindles, or
IOPS to the disk subsystem cache). Typically the limit for thruput is gated by the links. For
example, a 4 Gbps link is capable of approximately 400 MB/s of simplex (in one direction)
thruput, and 800 MB/s of duplex thruput, as there is both a transmit and receive fibre in the
cable. However, the host or storage adapter may not be able to handle this thruput. So the
sizing is done based on the expected IO workload: either using IOPS for small block IO or
MB/s for large block IO.

Often the IOPS bandwidth of the storage adapters is different than that of the host
adapters, and in such a case we might not have the same number of host ports and storage
ports. For example, if the host adapter can perform 3X the IOPS as the storage adapters, a
balanced design would have 3X the storage adapters and ports as compared to host
adapters and ports. Or if we size for large block IO and thruput, we might have host
adapters operating at 4 Gbps to a switch and storage operating at 2 Gbps. In such a case
we'd have twice as many links from the storage to the switch as compared to links from the
host to the switch.

Often, the minimum number of ports is sufficient for many workloads, but with VIO and
many VIOCs, or very high IOPS or MB/s workloads, sizing is important to avoid performance
problems. Alternatively, one can implement a solution, determine if more bandwidth is
needed, and add it then.

Note that while bandwidth is related to the number of paths for a single LUN, they are
different things. Adding more bandwidth only improves performance if there are bottlenecks
or heavily utilized components in the solution.

It's also worth considering the question of how many paths are best? Two physically
independent paths are sufficient for availability; thus, one would have two host adapters,
two storage adapters, and links to two physical SAN switches. But beyond this, we only
need more physical resources for additional bandwidth. This adds more potential paths. So
the approach to take is to configure sufficient resources for the bandwidth needed, then to
use SAN zoning and LUN masking to keep the number of paths to a reasonable level.

Disk Subsystem Considerations

Every disk subsystem has architectural considerations that affect disk path design. These
include considerations involving availability and utilization of various disk subsystem
resources.

Some disk subsystems, such as DS5000 and the SVC, have an active/passive controller
design where each LUN is handled by one disk subsystem controller, and the passive
controller is only used in the event the active controller (or all paths to it) fail. Typically the
storage administrator will assign half the LUNs to each controller to balance use of the
controllers. So the path selection algorithms will normally only use paths to the primary
controller; thus, typically only half the paths for each LUN will be used.

On the DS8000, there are up to 4 host ports per adapter, and groups of 4 adapters reside in
a host bay that might be taken offline for maintenance. So one would want to have paths to
different adapters, as opposed to paths to different ports on the same adapter for
availability. Similarly, we'd want paths to adapters in different host bays in case one needs
to be taken offline.

On the SVC, there are up to 8 nodes with 2 nodes per IO group. Each IO group has its own
cache, fibre channel ports and processors, and IOs for a LUN are handled by a single node.
To fully utilize all the SVC resources, one may want the storage administrator to balance the
LUNs for a host across all the available nodes and IO groups.

To balance the use of disk subsystem resources, one needs a certain number of LUNs. To
balance use across, say two storage controllers, we need at least 2 LUNs and preferably an
even number. To balance use across the 4 host bays on the DS8000, we'd need at least 4
LUNs and preferably a multiple of 4. To fully utilize all SVC resources for an 8 node SVC,
one would need a minimum of 8 LUNs and preferably a multiple of 8. And laying out the
data so that IOs are balanced across LUNs will ensure that IOs are balanced across the
resources.

Methodology

The approach to reduce the number of potential paths to a reasonable level is to create
subsets of both the storage and host ports, and to connect these subsets together. We'll
examine this by going thru a few examples.
Example 1: 8 host ports and 8 storage ports with a single SAN fabric

Figure 5

In this example, there are potentially 64 paths for each LUN which is more than is
recommended. So a simple solution is to create two groups, or subsets, of four ports on the
host (host ports 1-4 and host ports 5-8) and two groups of four ports on the storage
(storage ports 1-4 and storage ports 5-8), as represented by the blue and red links
connected to the ports in Figure 5.

Using single initiator zoning, the first SAN zone would have host port 1 zoned to storage
ports 1-4, then host port 2 zoned to storage ports 1-4, ..., host port 5 to storage ports 5-8,
etc. Assuming the storage doesn't offer LUN masking, then the storage administrator would
create two host port groups (the first containing host ports 1-4) and assign half the LUNs to
each group. If the storage offers LUN masking, then a single host port group could be used
(with host ports 1-8) and the storage administrator would assign half the LUNs to use
storage ports 1-4 and the other half to use storage ports 5-8. With the SAN zoning, storage
port 1 would not be able to communicate with host ports 5-8.

Assuming the IOs are evenly balanced across the LUNs, the IOs will be balanced across the
paths. This results in a total of 16 paths per LUN.

Alternatively we could create 4 groups of 2 ports on the host and similarly on the storage as
shown in Figure 6 with each group represented by a different color.
Figure 6

.SAN zoning and assignment of the LUNs (using either LUN masking with 2 ports used per
LUN, or using 4 host port groups) would be similar, and yields 4 paths per LUN.

Example 2: 8 host ports and 8 storage ports with a dual SAN fabric

Figure 7

In this example, we'd have half the storage ports and half the host ports on each fabric:
fabric 1 (represented by a single FC switch in Figure 7) containing host ports 1-4 and
storage ports 1-4 and fabric 2 with the other ports. Thus, we'd have 16 potential paths for
each LUN on each fabric (4 host ports x 4 storage ports) with a total of 32 potential paths
per LUN. So we can create two groups or subsets of 4 ports on the host and two groups or
subsets of 4 ports on the storage as represented by the red and blue links connected to the
ports in Figure 7 to get down to 8 paths per LUN.

Again we'd use single port zoning and taking host port 1, it would be zoned to storage ports
1 and 2. If the disk subsystem offers LUN masking, we can use a single host port group for
the host, and half the LUNs would use be assigned to the host port group via storage ports
1, 2, 5 and 6, and the other LUNs would use host ports 3, 4, 7 and 8. If the storage doesn't
offer LUN masking, then we'd use two host port groups with host ports 1, 2, 5 and 6 in the
first group. Then the storage administrator would assign half the LUNs to each host port
group, and the zoning would assure that only the appropriate host ports would be used.

To reduce the number of paths to 4, and assuming the disk subsystem supports LUN
masking, the storage administrator could evenly split the LUNs across pairs of storage ports
rather than groups of 4 as suggested previously. E.G., LUN1 would use storage ports 1 and
5, LUN 2 using ports 2 and 6, etc. Without LUN masking, we could use 4 groups or subsets
of the links rather than 4 as shown in Figure 7 and this would reduce the number of paths
per LUN to 2 which is sufficient for availability, but may not provide the belt and suspenders
availability that more paths offer.

Example 3: 4 host ports and 8 storage ports with a dual SAN fabric

Figure 8

This example shows a situation in which the host ports have twice the bandwidth of the
storage ports; thus, we're using twice the ports on the storage as the host. Potentially we
have 16 paths to each LUN (8 per fabric). By creating two groups or subsets, with each
group containing two host ports (one for each fabric) and 4 storage ports (two for each
fabric) as represented by the different colors in the figure. SAN zoning again is single
initiator, with host port 1 zoned to storage ports 1 and 2, and similarly for the others. This
zoning alone reduces the number of paths to 8. The storage administrator can then thru
LUN masking or via two host port groups, further reduce the number of paths to 4.
Spreading the IOs evenly across the ports and paths

If all LUNs use all ports, and we use some method to load balance the IOs, via the multi-
path IO driver, then IOs will be balanced across ports. But here we expect we need to
reduce the number of paths, so not all server ports will see all storage ports. And not all
LUNs will necessarily use all ports for IO. Thus we'll need some way to reasonably balance
IOs across the ports and paths. As stated earlier:

The basic approach taken here is to group the sets of host and storage ports into a number
of subsets, and then using SAN zoning and or LUN masking, connect a set of host ports to
storage ports to keep the number of paths to a reasonable level. Then the full set of LUNs
for a server are also split into the same number of subsets, and assign LUNs in one subset
to use one subset of host and storage ports.

So if IOs are balanced evenly across LUNs, then evenly balancing LUNs across the subsets
of host and storage ports will achieve balance across paths and ports. For many disk
subsystems and applications, the best practice data layout (including the LVM setup)
achieves balance across LUNs. So in that case we just evenly split the LUNs across the
subsets of host/storage ports. For applications whose layout doesn't evenly balance IOs
across LUNs, then one approach is to split LUNs from RAID arrays across the paths (which
assumes the data layout achieves balanced IOs across the RAID arrays). Finally, for disk
subsystems such as the XIV, or the SVC with striped VDisks (where IOs are balanced across
back end disks but not across LUNs), then the only way to balance the IOs is to look at the
IO rates for each LUN and then split them into separate groups such that the total IO rate
for each group is approximately the same. This can be accomplished as follows in figure 9:
Figure 9

In this case, each LUN has 2 paths across separate fabrics. Alternatively, if all LUNs use all
ports, but zoning is setup so that the host and storage ports of like color only do IO with
each other, then each LUN will have 8 paths. And this is an approach that can be used for
many configurations.

Você também pode gostar