VXDMP Corrupt

Problem
When the device paths are disconnected and reconnected, the Operating System may assign differ
ent OS device numbers to the device paths. Currently, the Data Corruption Prevention Activation
(DCPA) feature of DMP does not handle this situation correctly if only a partial number of paths
are assigned different device numbers.
If only a subset of paths are assigned different OS device numbers, the DMP database of
vxconfigd could get corrupted, resulting in the VxVM configuration daemon, vxconfigd,
terminating abnorma lly and generating core. If vxconfigd subsequently uploads this incorrect
DMP database to kernel, the DMP database in kernel gets corrupted. This can result in data
corruption as the application IO can go to wrong devices.
Error Message
The following error messages indicate that the device paths were assigned different OS device
numbers on the Linux platform.
Aug 17 08:43:31 server101 kernel: sd 4:0:0:3: Warning! Received an indication that the LUN
assignments on this target have changed. The Linux SCSI layer does not automatically remap
LUN assignments.
LUN assignments.
On Solaris the following message is logged.

Oct 27 12:15:01 hosty scsi: [ID 243001 kern.info]/pci@1e,600000/pci@0/pci@9/pci@0/pci@8
/SUNW,qlc@1/fp@0,0 (fcp1):
Oct 27 12:15:01 hosty FCP: Report Lun Has Changed target=b0500
One or more of the following symptoms may occur.
1. The DMP configuration is incorrect, as shown by the "vxdisk list" or "vxdmpadm getsubpaths"
commands. Paths to different LUNs are claimed under one single DMP device.
2. vxconfigd process core-dumped.
3. File system corruption messages are logged in the system log message file.
4. File system is disabled because serious corruption is detected.
Cause
When device paths are disconnected and reconnected, the operating system will release the OS
device numbers for reuse in the future after a certain period of time. For example, on Linux the
length of this period is controlled by kernel parameter dev_loss_tmo which is configurable. If the
device paths are reconnected after this period lapsed, the operation system will probably assign
different device numbers to the reconnected device paths. Currently, if only some of the paths (not
all of them) are assigned different device numbers, the DCPA feature of DMP will not be able to
handle this situation and the DMP configuration may become corrupt. If this corrupt DMP
configuration is uploaded to the DMP kernel driver, the DMP configuration will be corrupt and
can lead to data corruption because data will be written to the wrong disk.
Potentially the issue can affect all platforms, but the Linux platform is more prone to hit the issue.
Please note that depending on the seriousness of the DMP configuration corruption and the actual
Veritas Storage Foundation configuration (e.g. standalone installation or Cluster Volume
Manager), the symptom of the issue may vary. Sometime it will be difficult to determine if a
particular proble m was caused by the aforementioned issue. The root cause analysis may involve
detailed analysis of the SAN topology and the changes to the SAN environment when the problem
happened, the system log messages, the DMP logs (e.g. dmpevents.log and ddl.log), vxconfigd
core if vxconfigd dumped core, the DMP configuration (e.g. vxdisk list output, vxdmpadm
getsubpaths output and kernel core dump), the hardware UDID and the on-disk udid of the dmp
devices.
The problem is tracked through the etrack incident listed in the Supplemental Material section.
The following is a description of the etrack incident.
Symptom:
When device paths are moved across LUNs or enclosures, vxconfigd daemon can dump core or
data corruption can occur due to internal data structure inconsistencies.
Description:
When the device path configuration is changed after a planned or unplanned disconnection by
moving only a subset of the device paths across LUNs or other storage arrays (enclosures), the
DMP's internal data structures get messed up leading to vxconfigd daemon dumping core and in
some situations data corruption due to incorrect LUN to path mappings.
Resolution:
To resolve this issue, the vxconfigd code was modified to detect such situations gracefully and
modify the internal data structures accordingly to avoid a vxconfigd coredump and data corruption
Solution
The problem is fixed in the following patch releases.
Veritas Storage Foundation 5.1SP1RP4 on all platforms. (sfha-<platform>-5.1SP1RP4)
Veritas Storage Foundation 6.0.3 Hot Fix 1 on all platforms. (vm-<platform>-6.0.3.100)
The above patches can be downloaded from the Symantec Operation Readiness Tools web site.
https://sort.symantec.com/patch/matrix
Before the above patch is applied, the temporary workaround on Linux is to increase the default
dev_loss_tmo to a high value to prevent device number reuse on a fabirc loss and restore.
Please note, for the limitation on the Linux, normally the maximum of dev_loss_tmo is 600. So,
for the workaround, we suggest to create the file /etc/udev/rules.d/40-rport.rules with the
following content line:
KERNEL=="rport-*", SUBSYSTEM=="fc_remote_ports",
ACTION=="add",RUN+="/bin/sh -c 'echo 600 > /sys/class/fc_remote_ports/%k/dev_loss_tmo'"
The above will set dev_loss_tmo to 600 for all Fibre Channel HBA drivers. The maximum
allowed value for dev_loss_tmo also depends on the actual Fibre Channel driver. Please check
with the operation system vendor if the FC driver supports value that can be set greater than 600.
Please note that the above workaround can only prevent the problem if the disconnected device
paths are reconnnected within the dev_loss_tmo. If the disconnnected device paths are reconnnect
ed after the dev_loss_tmo, then the OS can assign different device numbers and hence causes the
problem.
Details of the above kernel parameter can be found in the following Red Hat document.
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/5/htmlsingle
/Online_Storage_Reconfiguration_Guide
Applies To
The problem affects VxVM 5.1, 5.1SP1, 6.0.1 and 6.0.3 without the required fix. The problem
mainly affects the Linux platform.
How to identify the issue

===================
Please check the system log messages and see if any LUN paths were disconnected and reconnect
ed.
Please check if the following kernel messages were logged which indicate that the device paths
are assigned different device numbers. On Linux the following message is logged.
LUN assignments.
On Solaris the following message is logged.
Oct 27 12:15:01 hosty scsi: [ID 243001 kern.info]/pci@1e,600000/pci@0/pci@9/pci@0/pci@8
/SUNW,qlc@1/fp@0,0 (fcp1):
Oct 27 12:15:01 hosty FCP: Report Lun Has Changed target=b0500
Steps to determine if the DMP configuration is correct

========================================
If the problem is detected by the DMP Data Corruption Prevention Activation (DCPA)
mechanism, the following vxconfigd messages will be logged and DMP prevented the corruption
from happen ing.
Jun 29 05:00:03 hostx vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-13788 Data Corruption
Protection Activated - User Corrective Action Needed:
Jun 29 05:00:03 hostx vxvm:vxconfigd: [ID 702911 daemon.notice] V-5-1-13789 LUN serial
number 0013 of old node c3t50060E8005892C10d60 does not match with LUN serial number
08CB of new node c3t50060E8005892C10d60 even though both have same device number 118/
488
Jun 29 05:00:05 hostx vxdmp: [ID 238993 kern.notice] NOTICE: VxVM vxdmp 0 dmp_tur_
temp_pgr: open failed: error = 6 dev=0x147/0x21a
If the above messages are not observed, the DMP configuration needs to be checked to verify
consistency. The DMP configuration can be printed with the "vxdmpadm getsubpaths" command.
For example,
# vxdmpadm getsubpaths
NAME STATE[A] PATH-TYPE[M] DMPNODENAME ENCLR-NAME CTLR ATTRS
======================================================================
c0t0d0s2 ENABLED(A) - disk_0 disk c0 -
c3t50060E80004372C0d0s2 ENABLED(A) PRIMARY hds9500-alua88_84 hds9500-alua88 c3 -
Please check if the paths listed in the first column (NAME) actually belong to the DMP device
listed in the fourth column (DMPNODENAME).
The hardware UDID generated dynamically by the Array Support Library (ASL) and the on-disk
private region UDID can also be compared to see if vxconfigd is reading the correct private region
from the physical device. The UDID numbers can be obtained by running the "vxdisk -v list
<da>" command.
# vxdisk -v list hds9500-alua88_89
Device: hds9500-alua88_89
devicetag: hds9500-alua88_89
type: auto
hostid: alaw1
disk: name= id=1166618297.275.alaw1
group: name=coordg id=1166618355.279.alaw1
info: format=cdsdisk,privoffset=256,pubslice=2,privslice=2
flags: online ready private autoconfig noautoimport
pubpaths: block=/dev/vx/dmp/hds9500-alua88_89s2 char=/dev/vx/rdmp/hds9500-alua88_89s2
guid: {f77e7eea-1dd1-11b2-8947-0003bae2eb47}
udid: HITACHI%5FDF600F%5FD600101C%5F0059 <<<udid generated dynamically by ASL
....
Annotations:
tag udid_asl=HITACHI%5FDF600F%5FD600101C%5F0059 <<< udid stored in the private
region of the disk
Multipathing information:
numpaths: 2
c3t50060E80004372C0d5s2 state=enabled type=primary
c3t50060E80004372C1d5s2 state=enabled type=primary
Problem
Multiple LUN device paths are assigned to a single DMP device node after running vxdctl enable
under Storage Foundation 5.1 SP1
Error Message
# vxdisk list hitachi_usp-vm1_011a
Device: hitachi_usp-vm1_011a
devicetag: hitachi_usp-vm1_011a
type: auto
flags: error private autoconfig
pubpaths: block=/dev/vx/dmp/hitachi_usp-vm1_011as2
char=/dev/vx/rdmp/hitachi_usp-vm1_011as2
guid: -
udid: HITACHI%5FOPEN-V%20%20%20%20%20%20-SUN%5F06505%5F011A
site: -
Multipathing information:
numpaths: 10
c2t50060E8005650550d84s2 state=enabled
Cause
Macro function return value from DDL_CHECK_DEVICE_TPD() is not checked for validity by
the ddl_vendor_claim_device() function, resulting in corruption of the DMP device tree.
Solution
The in-core DMP database device tree corruption caused by this issue cannot be resolved online
and requires a system restart in order to correct.
Additionally impacted system should be targeted for patching immediately.
At authoring this issue is fixed in Patch 2 for Storage Foundation 5.1 SP1 RP1.
Issue will also be fixed by Rolling Patch 2 for Storage Foundation 5.1 SP1 and higher patches.
Applies To
This issue affects systems running Storage Foundation 5.1 SP1, Storage Foundation 5.1 SP1 RP1,
and Storage Foundation 5.1 SP1 RP1 Hotfixes 1 and 2.

VXDMP Corrupt

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

VXDMP Corrupt

Enviado por

Direitos autorais:

Formatos disponíveis

Problem

On Solaris the following message is logged.

How to identify the issue

Steps to determine if the DMP configuration is correct

Você também pode gostar