Recover A Failed RAID Without Deleting Data On APG40

Recover a failed RAID without deleting data on APG40
http://esessmw1008.ss.sw.ericsson.se/iview/ui/print.asp?t=1&Solution=C...
"Recover a failed RAID without deleting data on APG40"

ID: Domain: Usage Count: Class: Conflicts: SCS128928 primus_owner@PRIMPRD 3365 External1 0 Date Created: Date Modified: Modified By: Owner: Status: Suspected_Faulty: Type: Audience: Initiated by: Internal epamks (Mark Scrivener) 9/18/2002 12/22/2011 epamks (Mark Scrivener) epamks (Mark Scrivener) REL (released) No How to
Goal
Recover a failed RAID without deleting data on APG40 Re-create a dead RAID without deleting data on APG40 Re-create a dead array without deleting data on APG40
Fact
APG40 APG40C/2 Network: CDMA Network: GSM Network: WCDMA Network: Wireline Node: AXE BSC Node: AXE FNR Node: AXE HLR Node: AXE MSC Service: Engine Integral
Symptom
Both nodes down AP FAULT PROBLEM: DOMAIN CONNECTION PROBLEM: GENERAL ERROR AP REBOOT, CAUSE by Command initiated AP PROCESS STOPPED, CAUSE by Process death Alarm: AP FAULT, MIRRORED DISKS NOT REDUNDANT. Both disks of a RAID have failed RAID marked as dead in DPT Storage Manager STS stopped due to dead RAID disk FOS failed Command: raidutil displays an extra RAID entry One node is Passive and one node is Undefined fcc_integrate was not executed correctly RTR is failed Event ID: 1034 The disk associated with cluster disk resource 'Disks J: K: L: M:' could not be found.
1 of 11
30-3-12 5:16 p.m.
The disk associated with cluster disk resource 'Disks ...' could not be found. The expected signature of the disk was xxxxxxxx. If the disk was removed from the cluster, the resource should be deleted. If the disk was replaced, the resource must be deleted and created again in order to bring the disk online. If the disk has not been removed or replaced, it may be inaccessible at this time because it is reserved by another cluster node. Both nodes in state undefined Command: net start clussvc fails with A system error has occurred., Size of job is %1 bytes. A system error has occurred. Size of job is %1 bytes. Command: net start clussvc fails with A system error has occurred., System error 2 has occurred., The system cannot find the file specified. System error 2 has occurred. No STS & no MML & One Node is undefined The system cannot find the file specified. Disk Resource is Failed Cluster disk resource failed fcc_save_to_remove other gives "removing mirroring: failed" 'fcc_save_to_remove other' command hangs System error 1067 has occurred. AP NOT AVAILABLE Alarm: STATISTICS AND TRAFFIC MEASUREMENT FILE ACCESS FAULT, STS COULD NOT ACCESS FILE OSS heartbeat failure alarm
Cause
The RAID will be failed (dead) when both disk drives belonging to the RAID are failed. The RAID information is corrupt and/or a RAID controller is faulty. One known cause is loading/updating the RAID firmware on an incompatible board. For example loading the FT06 RAID firmware (CN-I APZ 212 20/5-584 and -585) on version 3.1.3.3 of the PSU-HDD board. An incorrectly terminated SCSI bus. e.g. not doing "fcc_save_to_remove other". A task force was created in PDU to address the large number of emergencies caused by RAID failures. The first outcome of the task force is improved handling at the repair centre. e.g. If a node is returned due to a RAID failure the RAIDs are now being tested. The second outcome of the task force was a modification of the SCSI BUS RESELECTION time-out parameters on the SCSI disks. PDU believe that this will reduce the number of emergencies caused by RAID failures by at least 30% to 50%. The APG40 GCC (GSDC Spain) and PDU have setup a monthly "KCS Triggered Product Improvement" report to determine the most common problems in APG4x and make recommendations on how to fix them. The first SOLUTION fix in this Primus will be continuously updated to included any revelent information from this report. Ericsson internal only
Fix
REMEDY: CONDITIONS: 1. This solution is applicable to APG40C/1 and APG40C/2. 2. The status of a RAID is Failed, Impacted or Dead. If none of the RAIDs have the status Failed, Impacted or Dead then this solution is normally not applicable. See see the note "Is this solution right for me?" below for more information. AP Command: raidutil -L logical Example: C:\> raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed 3. The RAID NVRAM is not "MOT V1.1". RAID NVRAM version "MOT V1.1" has introduced problems which may cause this Primus solution to fail. Primus solution SCS684731 should be used to upgrade or downgrade the RAID firmware on both nodes.
2 of 11
30-3-12 5:16 p.m.
AP Command: raidutil -L version Example printout: # Controller Cache FW NVRAM BIOS SMOR Serial --------------------------------------------------------------------------d0 DPT PM3757U2 0MB FT0A MOT V1.1 10-10035 PROCEDURE: When a RAID is failed and/or both disks of the RAID are failed the OPI "AP, System Data Disk Restore" should normally be followed to fix the problem. The OPI fixes the problem by zapping the drives, destroying all data on the data disks. This Primus solution fixes the problem by deleting and re-creating the RAID definitions without data loss. This Primus solution is meant to be used as an alternative to the OPI. This Primus solution should therefore be used in similar circumstances. If this Primus solution does not fix the problem then the OPI "AP, System Data Disk Restore" should be considered. The procedure takes about 30 minutes and during this time there will be no MML contact, charging will be buffered and STS data will be lost. 1. Collect information for further analysis. Log the information below from both nodes and send the result to the owner of this solution. AP Command: hostname prcstate date/t time/t raidutil -L all frlbbdiag -v raidutil -K raidutil -e soft d0 raidutil -e recov d0 raidutil -e nonrecov d0 raidutil -e status d0 aehevls -l app -c dptelog mktr <YYMMDD>-<HHMM> -c 2. Determine the source disk for the RAID re-create. When the RAID is deleted and re-created a disk must be chosen as the source of the data for the RAID. In this solution the node that will be used as the source of the data will be be referred to as the good node and the other node will be referred to as the faulty node. This is the most important step of the procedure and it is recommdended that second line support performs this step. The "raidutil -e status d0" logs from both nodes should be used to determine the sequence of events. The node where the disks failed last should normally used as the source node. The frlbbdiag command must also be used to verify that the source node is also free from fault. Command: frlbbdiag -v raidutil -e status d0 3. Connect to the faulty node. This is the node that will not be used as the source of the data for the RAID. AP Command: hostname 4. Shutdown the node. AP Command: prcboot -s 5. Connect to the good node. Use the node IP address and not the cluster IP address. This is the node that will be used as the source of the data for the RAID. AP Command: hostname 6. Disable the "Cluster Server" and Ericsson services startup. Do not disable the "Cluster Disk" device as this will prevent the RAID from being deleted. Windows 2003 Command: sc config Clussvc start= Disabled
3 of 11
30-3-12 5:16 p.m.
sc config ACS_PRC_ClusterControl start= Disabled sc config ACS_FCH_Server start= Disabled sc config ACS_FCR_Server start= Disabled Windows NT Command: echo REGEDIT4 > C:\TEMP\Cluster_Disabled.reg echo. >> C:\TEMP\Cluster_Disabled.reg echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc] >> C:\TEMP\Cluster_Disabled.reg echo "Start"=dword:00000004 >> C:\TEMP\Cluster_Disabled.reg echo. >> C:\TEMP\Cluster_Disabled.reg echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_PRC_ClusterControl] >> C:\TEMP\Cluster_Disabled.reg echo "Start"=dword:00000004 >> C:\TEMP\Cluster_Disabled.reg echo. >> C:\TEMP\Cluster_Disabled.reg echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_FCH_Server] >> C:\TEMP\Cluster_Disabled.reg echo "Start"=dword:00000004 >> C:\TEMP\Cluster_Disabled.reg type C:\TEMP\Cluster_Disabled.reg regedit /s C:\TEMP\Cluster_Disabled.reg del C:\TEMP\Cluster_Disabled.reg 7. Set BIOS "Cluster Support" to Disabled (Off). AP Command: raidutil +cluster off 8. Reboot the node. Do not use prcboot. The normal "prcboot" command normalises the "Cluster Server" and Ericsson services startup. There may be no response from the terminal until the AP finishes rebooting after the shutdown command is entered. This will take about 6 minutes. Windows 2003 Command: shutdown /f /r /t 0 Windows NT Command: shutdown /f /r %COMPUTERNAME% 9. Check that SCSI disks are correct and available. If the 6 SCSI disks, 3 per node, can not be seen or the targets are incorrect then it will not be possible to re-create the RAID. AP Command: raidutil -L physical Example: C:\> raidutil -L physical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal d0b0t1d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Failed drive d0b1t0d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal d0b1t1d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal d0b1t2d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Failed drive 10. Check the size of the RAID. Make a note of the size of the RAID that will be deleted and re-created. If the capacity of the disks are different then the size of the RAID has to be set when it is re-created. AP Command: raidutil -L raid Example where the RAID size is 17432: C:\> raidutil -L raid Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17432MB Optimal d0b0t0d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal d0b1t0d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17432MB Optimal d0b0t1d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal d0b1t1d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal 17432MB Failed d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Failed drive d0b1t2d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Failed drive
4 of 11
30-3-12 5:16 p.m.
11. Delete the RAID. Only delete the RAIDs that are Failed, Impacted or Dead. If it is not possible to delete the RAID then follow the note " Additional steps to delete the RAID" below and then continue with the next step. AP Command: raidutil -D d0b0t<#>d0 Examples: Delete RAID d0b0t0d0: C:\> raidutil -D d0b0t0d0 d0b0t0d0 Delete RAID d0b0t1d0: C:\> raidutil -D d0b0t1d0 d0b0t1d0 Delete RAID d0b0t2d0: C:\> raidutil -D d0b0t2d0 d0b0t2d0 12. Check that the RAID has been deleted. If the RAID has not been deleted then follow the note "Additional steps to delete the RAID" below and then continue with the next step. AP Command: raidutil -L logical Expected Printout: Failure:Can't find component by address Expected Printout: Failure:Can't find component by address 13. Set the disk cache to write back. AP Command: raidutil -w on d0b0t<#>d0 raidutil -w on d0b1t<#>d0 Examples: RAID d0b0t0d0 deleted: C:\> raidutil -w on d0b0t0d0 C:\> raidutil -w on d0b1t0d0 RAID d0b0t1d0 deleted: C:\> raidutil -w on d0b0t1d0 C:\> raidutil -w on d0b1t1d0 RAID d0b0t2d0 deleted: C:\> raidutil -w on d0b0t2d0 C:\> raidutil -w on d0b1t2d0 14. Re-create the RAID. The first disk specified after the "-g" parameter is used as the source of the data when re-creating the RAID. The "-s" parameter is only required if the size of the RAID has to be set as described above. If the "-s" parameter is not specified then the size of the RAID is set to the capacity of the first disk specified after the "-g" parameter. Note: If it is not possible to re-create the RAID then follow the note "Disconnect SCSI cables" and then continue this procedure from the next step (that is, from step 15, without recreating the RAIDs). It is important to disconnect the SCSI cables or it is possible a disk on the shutdown node will still be accessed. This will leave the RAID deleted and allow the AP to run as a single node. The faulty node should be left shutdown as it will be unable to be active. The RAID must be re-created when the faulty node is replaced using the note "RAID re-create during node change" below. AP Command: raidutil -l 1 -g d0b0t<#>d0,d0b1t<#>d0 [-i -s <size>] Examples: Re-create RAID d0b0t0d0: C:\> raidutil -l 1 -g d0b0t0d0,d0b1t0d0 Created: RAID 1 Re-create RAID d0b0t1d0: C:\> raidutil -l 1 -g d0b0t1d0,d0b1t1d0
5 of 11
30-3-12 5:16 p.m.
Created:
RAID 1
Re-create RAID d0b0t2d0: C:\> raidutil -l 1 -g d0b0t2d0,d0b1t2d0 Created: RAID 1
Re-create RAID d0b0t0d0 with size 17432MB: C:\> raidutil -l 1 -g d0b0t0d0,d0b1t0d0 -i -s 17432 Created: RAID 1 Re-create RAID d0b0t1d0 with size 17432MB: C:\> raidutil -l 1 -g d0b0t1d0,d0b1t1d0 -i -s 17432 Created: RAID 1 Re-create RAID d0b0t2d0 with size 17432MB: C:\> raidutil -l 1 -g d0b0t2d0,d0b1t2d0 -i -s 17432 Created: RAID 1 15. Stop the RAID rebuild. This is a precaution in case the wrong node has been chosen as the source. AP Command: raidutil -a stop d0 16. Set the RAID cache to write through. AP Command: raidutil -w off d0b0t0d0 raidutil -w off d0b0t1d0 raidutil -w off d0b0t2d0 17. Check that the RAID has been re-created. If the RAID has not been re-created then contact the next level of support. AP Command: raidutil -L logical Example: C:\> raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded 18. Set BIOS "Cluster Support" to Enabled (On). AP Command: raidutil +cluster on 19. Normalise the "Cluster Server" and Ericsson services startup. Note: In APZ 11.3 and later the ACS_PRC_ClusterControl service startup type should be set to automatic. This will be done in a later step. Windows 2003 Command: sc config ClusSvc start= Auto sc config ACS_FCH_Server start= Auto sc config ACS_FCR_Server start= Auto Windows NT Command: echo REGEDIT4 > C:\TEMP\Cluster_Enabled.reg echo. >> C:\TEMP\Cluster_Enabled.reg echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc] >> C:\TEMP\Cluster_Enabled.reg echo "Start"=dword:00000002 >> C:\TEMP\Cluster_Enabled.reg echo. >> C:\TEMP\Cluster_Enabled.reg echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_PRC_ClusterControl] >> C:\TEMP\Cluster_Enabled.reg echo "Start"=dword:00000003 >> C:\TEMP\Cluster_Enabled.reg echo. >> C:\TEMP\Cluster_Enabled.reg echo [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ACS_FCH_Server] >> C:\TEMP\Cluster_Enabled.reg echo "Start"=dword:00000002 >> C:\TEMP\Cluster_Enabled.reg type C:\TEMP\Cluster_Enabled.reg regedit /s C:\TEMP\Cluster_Enabled.reg del C:\TEMP\Cluster_Enabled.reg
6 of 11
30-3-12 5:16 p.m.
20. Reboot the node. The prcboot command is not used with Windows Server 2003 due to problems with the node not rebooting. Windows 2003 Command: shutdown /f /r /t 0 If the printout below is received then repeat the command until successful. The command will not be successful until the "Preparing network connections" dialog disappears. The computer is processing another action and thus cannot be shut down. Wait until the computer has finished its action, and then try again.(21) Windows NT Command: prcboot 21. Check the status of the RAIDs. If the RAID status has returned to the status failed then replace the faulty node and repeat the procedure. If a spare node is not immediately available then follow the note "Disconnect SCSI cables" below and repeat this procedure. This will leave the RAID deleted and allow the AP to run as a single node. The faulty node should be left shutdown until a replacement is available. If this is done the RAID must be re-created when the faulty node is replaced using the note "RAID re-create during node change" below. AP Command: raidutil -L logical 22. Wait for all resources to come online. The resources owned by the faulty, shutdown node will not come online. If the faulty node is going to be replaced then the procecure is complete. 23. Reboot the faulty, shutdown node. This step should not be performed if the faulty node should be left shutdown or if the RAID was not re-created. AP Command: fcc_reset other 24. Wait for all resources to come online. 25. Normalise the ACS_PRC_ClusterControl resource. This step is not required with Windows NT as the prcboot command above sets the startup type. Windows Server 2003 Command: sc config ACS_PRC_ClusterControl start= Auto 26. Make sure the RAID rebuild is set to fast. AP Command: raidutil -r fast d0 Example: C:\> raidutil -r fast d0 Address Type Rate --------------------------------------------------------------------------d0b0t7d0 HBA 9.0s (fast) d0b0t2d0 RAID 1 (Mirrored) 9.0s (fast) d0b0t0d0 RAID 1 (Mirrored) 9.0s (fast) d0b0t1d0 RAID 1 (Mirrored) 9.0s (fast) 27. Check the RAID disks for faults. If there are any faults then follow the OPI "AP FAULT" and do not attempt a rebuild - do not perform the remaining steps in this procedure. Command: frlbbdiag 28. Rebuild the re-created RAIDs. AP Command: raidutil -a rebuild d0b0t<#>d0 Examples: Rebuild RAID d0b0t0d0: C:\> raidutil -a rebuild d0b0t0d0
7 of 11
30-3-12 5:16 p.m.
d0b0t0d0 Rebuild RAID d0b0t1d0: C:\> raidutil -a rebuild d0b0t1d0 d0b0t1d0 Rebuild RAID d0b0t2d0: C:\> raidutil -a rebuild d0b0t2d0 d0b0t2d0 29. Perform a health check of the AP. Follow Primus solution SCS123402. 30. Query and change the SCSI BUS RESELECTION settings with FrChangeDisk. Follow Primus solution SCS841510. 31. Implement APG40C/2 RAID improvements as per the SOLUTION fix below. SOLUTION: CONDITIONS: 1. As in the REMEDY above.
PROCEDURE: Implement recommendations from PDU task force and GCC/PDU APG40 KCS Triggered Product Improvement. 1. Implement the SCSI BUS RESELECTION time-out parameter change. This change is introduced with CN-I APZ 212 30/4-1126. This CN-I is included in the follow packages: - BSC PLM: APG40 One Trace: IP-A203. - MSC PLM: APG40 One Track EP-A111. - APZ PLM: APG40 One Track AGM018. The FrChangeDisk tool introduced with CN-I APZ 212 30/4-1126 has been updated to fix faults in the following CN-Is. CN-I APZ 212 30/4-1233. This CN-I is included in the following packages: - APZ PLM: APG40 One Track AGM019. CN-I APZ 212 30/4-1487. This CN-I is included in the following packages: - APZ PLM: APG40 One Track UAM009. 2. Implement the FrLbbDiag tool and ContLogCollector service. This change is introduced with CN-I APZ 212 30/4-1140. This CN-I is included in the following packages: - APZ PLM: APG40 One Track AGM018. The FrLbbDiage tool introduced with CN-I APZ 212 30/4-1140 has been updated in the following CN-Is. CN-I APZ 212 30/4-1375. This CN-I is included in the following packages: - APZ PLM: APG40 One Tracke AGM020. 3. Implement NVRAM Force V2.1. This change is introduced with CN-I APZ 212 30/4-1217. This CN-I is included in the follow packages: - APZ PLM: APG40 One Track AGM019. 4. Use the updated AP FAULT OPI when rebuilding the RAID when the AP FAULT alarm is raised. This change is introduced with CN-I APZ 212 30/4-1373. This CN-I is included in the follow packages:
8 of 11
30-3-12 5:16 p.m.
- APZ PLM: APG40 One Track AGM020.
Ericsson internal only SOLUTION: CONDITIONS: 1. As in the REMEDY above. 2. Following the procedure in the above remedy either did not fix the problem, or there was a subsequent occurrence of the same fault.
PROCEDURE: 1. Using the information from the log files gathered in the REMEDY above, determine which node is most likely to be faulty. 2. Change the node. See the Operational Instruction "APG40, Node, Change". If unsure about which node should be changed, please contact the next level of support for assistance. It is possible that the actual fault is not in the indicated node, but in the other node, or in one of the SCSI cables connecting the two nodes. These may also need to be changed if changing the indicated node still does not fix the problem. SOLUTION: CONDITIONS: 1. As in the first REMEDY above. PROCEDURE: 1. It is normal for a RAID to be failed when both hard disks have failed. It is the opinion of design that the RAID should be failed when both disk drives belonging to the RAID have failed. Preventing this issue from occurring requires choosing a disk drive to be used as the source of the data for the RAID. It is the opinion of design that it is too dangerous to allow the system to do this and it is better to follow the OPI "AP, System Data Disk Restore" and erase the data disks. It is therefore important to replace any node with a failed disk drive as soon as possible. There have been several other TRs raised on this issue with TR HE82881 is a good example of the designs opinion of the problem. TR HG51610 has been raised to address this issue. Ericsson internal only
Note
Is this solution right for me? If the "raidutil -L raid" printout displays 3 "RAID 1" entries each with 2 "Disk Drive" entries, with correct targets, and the status of the "RAID 1" entries is Optimal or Degraded then this solution is NOT applicable. Example 1: The printout shows a degraded RAID and a failed disk drive. This solution is NOT applicable. The OPI "AP FAULT" should be used instead. C:\> raidutil -L raid Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b1t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded d0b1t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Failed d0b0t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal Example 2: The printout shows a degraded RAID and a missing disk drive. This solution is NOT applicable. Primus solution SCS388828 should be used instead. It may be necessary to reboot the node after power cyclicing the faulty node for the SCSI disks to be scanned. C:\> raidutil -L raid Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal
9 of 11
30-3-12 5:16 p.m.
d0b1t1d0 d0b0t2d0 d0b0t2d0 d0b1t2d0
Disk RAID Disk Disk
Drive (DASD) 1 (Mirrored) Drive (DASD) Drive (DASD)
FUJITSU DPT FUJITSU FUJITSU
MAN3184MP RAID-1 MAN3184MP MAN3184MP
17522MB 17522MB 17522MB 0MB
Optimal Degraded Optimal Missing
Example 3: The printout shows a degraded RAID and a missing disk drive with an invalid target. This solution may be applicable. If "raidutil -L physical" correctly shows all 6 disks then Primus solution SCS388828 should be used. If the problem perists the faulty node should be shut down and the source node rebooted. If the problem persists this solution should be followed to delete and re-create the corrupted RAID information. C:\> raidutil -L raid Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b1t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Degraded d0b0t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b1t3d0 Disk Drive (DASD) FUJITSU MAN3184MP 0MB Missing Example 4: The printout shows a failed RAID. This solution is applicable. The failed RAIDs need to be deleted and re-created. C:\> raidutil -L raid Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b1t0d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Optimal d0b0t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b1t1d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal RAID 1 (Mirrored) DPT RAID-1 17522MB Failed d0b0t2d0 d0b0t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal d0b1t2d0 Disk Drive (DASD) FUJITSU MAN3184MP 17522MB Optimal Special note when the OPI "APG40, Node, Change" was followed without zapping the RAIDs on the replaced node. Example 5: The printout shows a failed RAID. This solution is applicable. The failed RAIDs need to be deleted and re-created. In this case the hard disks on the non-replaced node should be used as the source of the data. Therefore the procedure should be performed on that node. C:\> raidutil -L raid Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed d0b1t4d0 Disk Drive (DASD) DPT --UNKNOWN-0MB Missing d0b0t0d0 Disk Drive (DASD) FUJITSU MAP3367NP 17522MB Failed drive Failed d0b0t1d0 RAID 1 (Mirrored) DPT RAID-1 17522MB d0b1t5d0 Disk Drive (DASD) DPT --UNKNOWN-0MB Missing d0b0t1d0 Disk Drive (DASD) FUJITSU MAP3367NP 17522MB Failed drive d0b0t2d0 RAID 1 (Mirrored) DPT RAID-1 17522MB Failed d0b1t3d0 Disk Drive (DASD) DPT --UNKNOWN-0MB Missing d0b0t2d0 Disk Drive (DASD) FUJITSU MAP3367NP 17522MB Failed drive RAID re-create during node change 1. Follow the OPI "APG40, Node, Change" until the SCSI cables are reconnected and the node is powered on. OPI "APG40, Node, Change, APG40C/2": 7/154 31-CRZ 222 02 revision X: Stop after step 136. 7/154 31-CRZ 222 02 revision Z: Stop after step 146. 7/154 31-CRZ 222 04 revision A: Stop after step 136. 7/154 31-CRZ 222 04 revision B: Stop after step 145. 7/154 31-CRZ 222 05 revision E: Stop after step 132. 7/154 31-CRZ 222 05 revision K: Stop after step 132. 7/154 31-CRZ 222 05 revision M: Stop after step 164. 7/154 31-CRZ 222 05 revision S: Stop after step 153. 7/154 31-CRZ 222 05 revision T: Stop after step 143. 7/154 31-CRZ 222 05 revision U: Stop after step 178. OPI "APG40, Node, Change, C/2, Win 2003 Spare": 12/154 31-CRZ 222 05 revision C: Stop after step 141 2. Repeat the procedure above to re-create the deleted RAID.
10 of 11
30-3-12 5:16 p.m.
Note: As the RAID has already been deleted the step "Delete the RAID" in the procedure should be skipped. 3. Continue with the OPI "APG40, Node, Change" from the next step.
Additional steps to delete the RAID. This note contains additional steps for step "Delete the RAID" in the procecure above. 1. Disconnect the SCSI cables. Remove the Remove the Remove the Remove the upper (top) SCSI cable from the good node. lower (bottom) SCSI cable from the good node. upper (top) SCSI cable from the faulty node. lower (bottom) SCSI cable from the faulty node.
2. Delete the RAID. If it is not possible to delete the RAID then contact the next level of support. AP Command: raidutil -D d0b0t<#>d0 Examples: Delete RAID d0b0t0d0: C:\> raidutil -D d0b0t0d0 d0b0t0d0 Delete RAID d0b0t1d0: C:\> raidutil -D d0b0t1d0 d0b0t1d0 Delete RAID d0b0t2d0: C:\> raidutil -D d0b0t2d0 d0b0t2d0 3. Check that the RAID has been deleted. If the RAID has not been deleted then contact the next level of support. AP Command: raidutil -L logical 4. Reconnect the SCSI cables. Connect Connect Connect Connect the upper (top) SCSI cable to the faulty node. the lower (bottom) SCSI cable to the faulty node. the upper (top) SCSI cable to the good node. the lower (bottom) SCSI cable to the good node.
5. Check that The six SCSI disks are correct and available. If the 3 SCSI disks on bus 1 are not visible then follow the step "Reboot the node" in the procedure. AP Command: raidutil -L physical Example: C:\> raidutil -L physical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------d0b0t0d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal d0b0t1d0 Disk Drive (DASD) FUJITSU MAH3182MP 17432MB Optimal d0b0t2d0 Disk Drive (DASD) FUJITSU MAT3073NP 17522MB Optimal 6. Continue with the procedure above. Continue with the step "Re-create the RAID" in the procedure. Disconnect SCSI cable 1. Disconnect the SCSI cables. Remove the Remove the Remove the Remove the upper (top) SCSI cable from the good node. lower (bottom) SCSI cable from the good node. upper (top) SCSI cable from the faulty node. lower (bottom) SCSI cable from the faulty node.
11 of 11
30-3-12 5:16 p.m.

Recover A Failed RAID Without Deleting Data On APG40

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Recover A Failed RAID Without Deleting Data On APG40

Enviado por

Direitos autorais:

Formatos disponíveis

Recover a failed RAID without deleting data on APG40