Você está na página 1de 33

VERITAS Cluster Server

3.5 for Solaris


Lesson 14
Troubleshooting

Overview
Troubleshooting

Using Volume
Manager
Event
Notification
Service Group
Basics
Introduction
VCS_3.5_Solaris_R3.5_2002091
5

Cluster
Communication

Faults and
Failovers
Preparing
Resources

Terms
and
Concepts

Installing
Applications
Resources
and Agents

Installing
VCS

Managing
Cluster
Services

NFS
Resources
Using
Cluster
Manager
14-2

Objectives
After completing this lesson, you will be able to:
Monitor system and cluster status.
Apply troubleshooting techniques in a VCS
environment.
Detect and solve VCS communication problems.
Identify and solve VCS engine problems.
Correct service group problems.
Resolve problems with resources.
Solve problems with agents.
Correct resource type problems.
Plan for disaster recovery.
VCS_3.5_Solaris_R3.5_2002091
5

14-3

Monitoring VCS
VCS log files
System log files
The hastatus utility
SNMP traps
Event notification triggers
Cluster Manager

VCS_3.5_Solaris_R3.5_2002091
5

14-4

VCS Log Entries

Engine log: /var/VRTSvcs/log/engine_A.log


View logs using the GUI or the hamsg command:
hamsg engine_A

Example entries:
TAG_D 2001/04/03 12:17:44
started
TAG_D 2001/04/03 12:17:44
TAG_C 2001/04/03 12:17:45
exited errno 10054
TAG_E 2001/04/03 12:17:52
membership
TAG_E 2001/04/03 12:17:52
Jeopardy: 0x0

VCS:11022:VCS engine (had)


VCS:10114:opening GAB library
VCS:10526:IpmHandle::recv peer
VCS:10077:received new cluster
VCS:10080:Membership: 0x3,
Most Recent

VCS_3.5_Solaris_R3.5_2002091
5

14-5

Agent Log Entries


Agent logs kept in /var/VRTSvcs/log
Log files named AgentName_A.log
LogLevel attribute settings:
none
error (default setting)
info
debug
all
To change log level:
hatype -modify res_type LogLevel debug
VCS_3.5_Solaris_R3.5_2002091
5

14-6

Troubleshooting Guide
Start by running hastatus -summary:
Cluster communication problems are indicated by
the message:
Cannot connect to server -- Retry Later
VCS engine startup problems are indicated by
systems in one of the WAIT states.
Service group, resource, or agent problems are
indicated within the hastatus display.

VCS_3.5_Solaris_R3.5_2002091
5

14-7

Cluster Communication Problems


Run lltconfig to determine if LLT is running. If LLT is not
running:
Check the /etc/llttab file:
Verify that the node number is within range (0-31)
Verify that the cluster number is within range (0-255).
Determine whether the link directive is specified correctly (qf3
should be qfe, for example).

Check the /etc/llthosts file:


Verify that node numbers are within range.
Verify that the system names match the entries in the llttab or
sysname files.

Check the /etc/VRTSvcs/conf/sysname file:


Make sure there is only one system name in the file.
Verify that the system name matches the entry in the llthosts
file.
VCS_3.5_Solaris_R3.5_2002091
5

14-8

Problems with LLT


If LLT is running:
Run lltstat -n to determine if systems can see each other on
the LLT link.
Check the physical network connection(s) if LLT cannot see
each node.
train11#lltconfig
LLTisrunning

train11#lltstatn
LLTnodeinformation:
NodeStateLinks
*0train11OPEN2
1train12CONNWAIT2

VCS_3.5_Solaris_R3.5_2002091
5

train12#lltconfig
LLTisrunning

train12#lltstatn
LLTnodeinformation:
NodeStateLinks
0train11CONNWAIT2
*1train12OPEN2
14-9

Problems with GAB


Check GAB by running gabconfig a:
No port a membership indicates a GAB problem.
Check the seed number in /etc/gabtab.
If a node is not operational, hence the cluster is not seeded,
force GAB to start:
gabconfig -x
If GAB starts and immediately shuts down, check LLT and
private network cabling.
No port h membership indicates a VCS engine (had) startup
problem.
HAD not running:
#gabconfiga
GABPortMemberships
========================
VCS_3.5_Solaris_R3.5_2002091
5

#gabconfiga GAB and LLT functioning


GABPortMemberships
===================================
Portagen24110002membership01
14-10

VCS Engine Startup Problems


Check the VCS engine (HAD) by running
hastatus sum:
Check GAB and LLT if you see this messsage:
Cannot connect to server -- Retry Later

Verify that the main.cf file is valid and that system


names match llthosts and llttab:
hacf verify /etc/VRTSvcs/conf/config

Check for systems in WAIT states:


STALE_ADMIN_WAIT: The system has a stale configuration
and no other system is in a RUNNING state.
ADMIN_WAIT: The system cannot build or obtain a valid
configuration.
VCS_3.5_Solaris_R3.5_2002091
5

14-11

STALE_ADMIN_WAIT
To recover from STALE_ADMIN_WAIT state:
1. Visually inspect the main.cf file to determine
whether it is valid.
2. Edit the main.cf file, if necessary.
3. Verify the syntax of main.cf, if modified.
hacf verify config_dir

4. Start VCS on the system with the valid main.cf file:


hasys -force system_name

5. All other systems perform a remote build from the


system now running.
VCS_3.5_Solaris_R3.5_2002091
5

14-12

ADMIN_WAIT
A system can be in the ADMIN_WAIT state under
these circumstances:
A .stale flag exists and the main.cf file has a
syntax problem.
A disk error occurs affecting main.cf during a
local build.
The system is performing a remote build and last
running system fails.

Restore main.cf and use the procedure for


STALE_ADMIN_WAIT.
VCS_3.5_Solaris_R3.5_2002091
5

14-13

Identifying Other Problems


After verifying that HAD, LLT, and GAB are
functioning properly, run hastatus sum to
identify problems in other areas:
Service groups
Resources
Agents and resource types

VCS_3.5_Solaris_R3.5_2002091
5

14-14

Service Group Problems: Group Not


Configured to Start or Run
Service group not onlined automatically
when VCS starts:
Check AutoStart and AutoStartList attributes:
hagrp display service_group
Service group not configured to run on the
system:
Check the SystemList attribute.
Verify that the system name is included.

VCS_3.5_Solaris_R3.5_2002091
5

14-15

Service Group AutoDisabled


Autodisable occurs when:
GAB sees a system but had is not running on the system.
Resources of the service group are not fully probed on all
systems in the SystemList.
A particular system is visible through disk heartbeat only.

Make sure that the service group is offline on all


systems in SystemList attribute.
Clear the AutoDisabled attribute:
hagrp autoenable service_group -sys system

Bring the service group online.

VCS_3.5_Solaris_R3.5_2002091
5

14-16

Service Group Not Fully Probed


Usually a result of improperly configured resource
attributes:
Check ProbesPending attribute:
hagrp -display service_group
Check which resources are not probed:
hastatus -sum
Check Probes attribute for resources:
hares -display
To probe resources:
hares probe resource -sys system

VCS_3.5_Solaris_R3.5_2002091
5

14-17

Service Group Frozen


Verify value of Frozen and TFrozen attributes:
hagrp -display service_group
Unfreeze the service group:
hagrp -unfreeze group [-persistent]
If you freeze persistently, you must unfreeze
persistently.

VCS_3.5_Solaris_R3.5_2002091
5

14-18

Service Group Is Not Offline


Elsewhere
Determine which resources are online/offline:
hastatus -sum

Verify the State attribute:


hagrp -display service_group

Offline the group on the other system:


hagrp -offline

Flush the service group:


hagrp -flush service_group -sys system

VCS_3.5_Solaris_R3.5_2002091
5

14-19

Service Group Waiting for Resource


Review Istate attribute of all resources to determine
which resource is waiting to go online.
Use hastatus to identify the resource.
Make sure the resource is offline (at the operating
system level).
Clear the internal state of the service group:
hagrp flush service_group -sys system
Bring all other resources in the service group offline and
try to bring these resources online on another system.
Verify that the resource works properly outside VCS.
Check for errors in attribute values.
VCS_3.5_Solaris_R3.5_2002091
5

14-20

Incorrect Local Name


A service group cannot be brought online if the system
name is inconsistent in llthosts, llttab, or main.cf
files.
1. Check each file for consistent use of system names.
2. Correct any discrepancies.
3. If main.cf is changed, stop and restart VCS.
4. If ltthosts or ltttab is changed:
a. Stop VCS, GAB, and LLT.
b. Restart LLT, GAB, and VCS.

VCS_3.5_Solaris_R3.5_2002091
5

14-21

Concurrency Violations
Occurs when a failover service group is online or
partially online on more than one system
Notification provided by the Violation trigger:
Invoked on the system that caused the concurrency violation
Notifies the administrator and takes the service group offline
on the system causing the violation
Configured by default with the violation script in
/opt/VRTSvcs/bin/triggers
Can be customized:
Send message to the system log.
Display warning on all cluster systems.
Send e-mail messages.
VCS_3.5_Solaris_R3.5_2002091
5

14-22

Service Group Waiting for Resource


to Go Offline
Identify which resource is not offline:
hastatus summary

Check logs.
Manually bring the resource offline, if necessary.
Configure ResNotOff trigger for notification or
action.

VCS_3.5_Solaris_R3.5_2002091
5

14-23

Resource Problems: Unable to


Bring Resources Online
Possible causes of failure while bringing
resources online:
Waiting for child resources
Stuck in a WAIT state
Agent not running

VCS_3.5_Solaris_R3.5_2002091
5

14-24

Problems Bringing Resources


Offline
Waiting for parent resources to come offline
Waiting for a resource to respond
Agent not running

VCS_3.5_Solaris_R3.5_2002091
5

14-25

Critical Resource Faults


Determine which critical resource has faulted:
hastatus summary

Make sure that the resource is offline.


Examine the engine log.
Fix the problem.
Verify that the resources work properly outside
of VCS.
Clear fault in VCS.

VCS_3.5_Solaris_R3.5_2002091
5

14-26

Clearing Faults

After external problems are fixed:


1.

Clear any faults on nonpersistent resources.


hares -clear resource -sys system

2.

Check attribute fields for incorrect or missing data.

If service group is partially online:


1.

Flush wait states:


hagrp -flush service_groupsyssystem

2.

Bring resources offline first before bringing them online.

VCS_3.5_Solaris_R3.5_2002091
5

14-27

Agent Problems: Agent Not


Running
Determine whether the agent for that resource is
FAULTED:
hastatus summary

Use the ps command to verify that the agent


process is not running.

Check the log files for:


Incorrect pathname for the agent binary
Incorrect agent name
Corrupt agent binary

VCS_3.5_Solaris_R3.5_2002091
5

14-28

Resource Type Problems


A corrupted type definition can cause agents to
fail by passing invalid arguments.
Verify that the agent works properly outside of
VCS.
Verify values for ArgList and ArgListValues type
attributes:
hatype display res_type

Restart the agent after making changes:


haagent start res_type -sys system

VCS_3.5_Solaris_R3.5_2002091
5

14-29

Planning for Disaster Recovery


Back up key VCS files:
types.cf and customized types files

main.cf
main.cmd
sysname
LLT and GAB configuration files
Customized trigger scripts
Customized agents

Use hagetcf to create an archive.


VCS_3.5_Solaris_R3.5_2002091
5

14-30

The hagetcf Utility


# hagetcf
Saving 0.13 MB
Enter path where configuration can be saved (default is /tmp):
Collecting package info
Checking VCS package integrity
Collecting VCS information
Collecting system configuration
..
Compressing /tmp/vcsconf.train12.tar to
/tmp/vcsconf.train12.tar.gz
Done. Please e-mail /tmp/vcsconf.train12.tar.gz to your
support provider.
VCS_3.5_Solaris_R3.5_2002091
5

14-31

Summary
You should now be able to:
Monitor system and cluster status.
Apply troubleshooting techniques in a VCS
environment.
Resolve communication problems.
Identify and solve VCS engine problems.
Correct service group problems.
Resolve problems with resources.
Solve problems with agents.
Correct resource type problems.
Plan for disaster recovery.
VCS_3.5_Solaris_R3.5_2002091
5

14-32

Lab Exercise 14
Your instructor will run scripts that cause
problems within your cluster environment.
Apply the troubleshooting techniques provided
in the lesson to identify and fix the problems.
Notify your instructor when you have restored
your cluster to a functional state.

VCS_3.5_Solaris_R3.5_2002091
5

14-33

Você também pode gostar