Você está na página 1de 56

Supporting SUSE Linux

Enterprise High
Availability Extension 11
Support and Trouble-shooting
Lars Marowsy-!r"e
Architect Storage and High-Availability
lmb@novell.com

Novell, Inc. All rights reserved. #
Agenda
Introduction
Summary of Cluster Architecture
Common Configuration Issues
athering Cluster-!ide Su""ort Information
#$"loring #ffects of Cluster #vents
Self-!ritten %esource Agents
&nderstanding 'og (iles
Introduction

Novell, Inc. All rights reserved. $
)
S&S# 'inu$ #nter"rise Server
)
S&S# 'inu$ #nter"rise *es+to"
)
S&S# 'inu$ #nter"rise ,oint of Service
)
#$tensions
-
S&S# 'inu$ #nter"rise %eal .ime
-
SUSE Linux High Availability Extension
-
S&S# 'inu$ #nter"rise /ono #$tension
SUSE Linux Enterprise
%a&ily

Novell, Inc. All rights reserved. '
(ata )enter )hallenges
/inimi0e un"lanned do!ntime
#nsure 1uality of service
Contain costs
&tili0e resources
#ffectively manage multi"le vendors
/inimi0e ris+

Novell, Inc. All rights reserved. *
SUSE Linux Enterprise High Availability Extension
+alue ,roposition
)
An integrated suite of robust o"en source clustering
technologies that im"lement highly available
"hysical and virtual services on 'inu$.
)
&sed !ith S&S# 'inu$ #nter"rise Server, it hel"s to
maintain business continuity, "rotect data, and
reduce un"lanned do!ntime for all mission critical
'inu$ !or+loads.
)
&sed !ith virtuali0ation, it adds !or+load based
availability and reliability.

Novell, Inc. All rights reserved. -
SUSE Linux Enterprise High Availability Extension
!ene.its
/eet service-level agreements
Continuous access to systems and data
/aintain data integrity
Scale-out infrastructure

Novell, Inc. All rights reserved. /
SUSE Linux Enterprise High Availability Extension
0ey %eatures
)
Servi1e Availability #$2-
-
,olicy driven clustering
2
3"enAIS messaging and
membershi" layer
2
,acema+er cluster
resource manager
)
Sharing and S1aling (ata-
a11ess by Multiple 3odes
-
Cluster file system
2
3C(S4
2
Clustered logical
volume manager
)
(isaster Toleran1e
-
*ata re"lication via I,
2
*istributed re"licated
bloc+ device
)
S1ale 3etwor Servi1es
-
I, load-balancing
)
User-.riendly Tools
-
ra"hical user interface
-
&nified command
line interface

Novell, Inc. All rights reserved. 4
SLES 15
,art of S'#S 56
3C(S4 7 #8/S4
*%9* 6.:
;ast4-H9
Heartbeat
o"enAIS
;ast4-/ulti"ath
,acema+er
Added in
S'# HA 55
3C(S4
general (S
HA
&I
&nified
C'I
;ast4-*%9*
SLE HA 11
#nhanced
*ata %e"lication
<eb &I
Samba
Cluster
Added in
S'# HA 55 S,5
/etro-Area
Cluster
Cluster Config
Synchroni0ation
Storage =uorum
Coverage
Node %ecovery
SLE HA 11 S,1
SUSE Linux Enterprise High Availability Extension
HA Sta1 .ro& 15 to 11

Novell, Inc. All rights reserved. 15
SUSE Linux Enterprise High Availability Extension
0ey %eatures in Servi1e ,a1 1
)
6eb 7U8 - Cross "latform management
)
Storage !ased 9uoru& )overage - Storage device as
a 1uorum instance
)
8ntegrated Sa&ba )lustering - Integration of Samba !ith
3C(S4 for higher through"ut and scale out
)
Metro-Area )lusters - Clustering bet!een different data
center locations
)
)luster-1on1urrent :A8(1 - Im"roved resilience
)
Enhan1e (ata :epli1ation - *%9* !ith 'inbit coo"eration
)
3ode :e1overy - %ea% to recovery server nodes
)
7%S# Migration Support - %ead-only access to (S4
for migration

Novell, Inc. All rights reserved. 11
SUSE Linux Enterprise High Availability Extension
,ri1ing
,ricing
-
$>? and $>?@?A
2
&S* ?BB "er year "er server
2
Su""ort level inherited from base S&S# 'inu$
#nter"rise Server
-
,o!er, Itanium, System 0
2
9undled !ith S&S# 'inu$ #nter"rise Server
2
Su""ort level inherited from base S&S# 'inu$
#nter"rise Server

Novell, Inc. All rights reserved. 1#
SUSE Linux Enterprise High Availability Extension
,ro&otion
#$isting Customers
-
(ree of charge subscri"tion
2
(or all valid S&S# 'inu$ #nter"rise Server subscri"tions
2
#ffective dateC Dune 5
st
466B
2
8alid for subse1uent subscri"tion "eriods if base S&S#
'inu$ #nter"rise Server is rene!ed on time

Novell, Inc. All rights reserved. 1;
Upper Node Limit

System Recovery
Disk Mirroring
Platform Support
HW Support
Storage Support
IS Support

!UI
"ommand line
Monitoring
Documentation
HP
HP-SG
IBM
HACMP
Veritas
VCS
MSFT
Cluster
Steeleye
Lifekeeper
RHAT
Ad. Plat.
N!ell
SL"S#$
N!ell
SL" HA ##
N!ell
SL" HA ## SP#
Net#ork Load$
%alancing
Setup& Installation
and "onfiguration
Area !ith enhancements in S,5
SUSE Linux Enterprise High Availability Extension
)o&petitive Lands1ape

Novell, Inc. All rights reserved. 1$
(%S (euts1he %lugsi1herung - government-o!ned erman Air .raffic Control
#nsures the availability of critical air traffic control services by Im"lementing a fail-over
solution using clusters of S&S# 'inu$ #nter"rise Servers.
7etroni1s - the largest "rovider of I. services in the Netherlands
Im"lemented a cost-effective high availability solution for a !eb-based customer
information system su""orting t!o million customers using S&S# 'inu$ #nter"rise
Server, SA,, 3racle %eal A""lication Clusters, and I9/ System $E>F6 hard!are.
<hen the solution detects a failure in one node, it seamlessly recovers all running
"rocesses on the remaining node in its cluster.
La )ura1ao - one of the to" 566 electronics and a""liance retailers in the &.S
focusing on the His"anic mar+et
Im"lemented S&S# 'inu$ #nter"rise Server in a clustered environment on H,
,ro'iant servers to run their mission critical databases and +ee"s 'a CuracaoGs
stores running !ithout interru"tion.
Unitop - one of the largest "roducers of anionic surfactant chemicals in India.
Im"lemented a certified high availability SA, #%, solution, using S&S# 'inu$
#nter"rise Server, I9/ System $ hard!are, I9/ *94 information management
soft!are, and SA,, for all its business activities and information.
SUSE Linux Enterprise High Availability Extension
)usto&er Exa&ples
Cluster Architecture

Novell, Inc. All rights reserved. 1*
; 3ode )luster <verview
Hernel
Ien
8/
5
'A/,
A"ache
I,
e$tE
Hernel Hernel
Corosync J o"enAIS
,acema+er
*'/
c'8/4J3C(S4
Ien
8/
4
Net!or+
'in+s
Clients
Storage

Novell, Inc. All rights reserved. 1-
e$tE, I(S 3C(S4
c'8/4
'ocal *is+s
SAN
(CKo#L, iSCSI
*%9* /ulti"ath I3
*'/
SC., .C,
&*,
multicast
&*,
multicast
#thernet
Infiniband
9onding
'inu$ Hernel
SA,
/yS='
libvirt
Ien
A"ache
iSCSI
(ilesystems
I, address
*%9*
clvmd
3cfs4@controld
dlm@controld
;aS.4
c
*%9*
c
3"enAIS
/,I3
'8S
%
e
s
o
u
r
c
e

A
g
e
n
t
s
'
S
9


i
n
i
t
S
.
3
N
I
.
H
'
%
/
...
*%AC
i'3
S9*
(encing
<eb &I
,ython &I
C%/ Shell
CI9
,olicy
#ngine
,acema+er
3"enAIS
(etailed +iew o. )o&ponents
,er 3ode=

Novell, Inc. All rights reserved. 1/
6hy 8s This Tal 3e1essary>
<e heard commentsC
)
CanGt you Must ma+e the soft!are stac+ easy
to understandN
)
<hy is a multi-node setu" more com"licated than a
single nodeN
)
osh, this is a!fully com"licatedO <hy is this stuff so
"o!erfulN I donGt need those other featuresO
.his session addresses most of these 1uestions
*esign and Architecture Considerations

Novell, Inc. All rights reserved. #5
7eneral )onsiderations
)
Consider the su""ort level re1uirements of your
mission-critical systems.
)
;our staff is your +ey assetO
-
Invest in training, "rocesses, +no!ledge sharing.
-
A good administrator !ill "rovide higher availability than a
mediocre cluster setu".
)
et e$"ert hel" for the initial setu", and
)
<rite concise o"eration manuals that ma+e sense at
Eam on a Saturday P-L
)
.horoughly test the cluster regularly.
-
&se a staging system before de"loying large changesO

Novell, Inc. All rights reserved. #1
Manage Expe1tations ,roperly
)
Clustering im"roves reliability, but does not
achieve 566Q, ever.
)
Clusters are more com"le$ than single nodes.
)
(ail-over clusters reduce service outage, but do
not eliminate it.
)
Clustering bro+en a""lications !ill not fi$ them.
)
Clusters do not re"lace bac+u"s, %AI*, or
good hard!are.

Novell, Inc. All rights reserved. ##
)o&plexity +ersus :eliability
)
#very com"onent has a failure "robability.
-
ood com"le$ityC %edundant com"onents.
-
&ndesirable com"le$ityC chained com"onents.
-
Cho+e "oint R single "oint of failure
-
Also considerC Administrative com"le$ity.
)
&se as fe! com"onents KfeaturesL as feasible.
-
3ur e$tensive feature list is not a mandatory chec+list for
your de"loyment P-L
)
<hat is your fall-bac+ in case the cluster brea+sN
-
9ac+u"s, non-clustered o"eration
-
Architect your system so that this is feasibleO

Novell, Inc. All rights reserved. #;
)luster Si?e )onsiderations
)
/ore nodesC
-
Increased absolute redundancy and ca"acity.
-
*ecreased relative redundancy.
-
3ne cluster R one failure domain.
)
*oes your !or+-load scale !ell to larger node countsN
)
Chose odd node counts.
-
A and E node clusters both lose maMority after 4 nodes.
)
=uestionC
-
F chea"er servers, or
-
E higher 1uality servers !ith more ca"acity eachN
Common Setu" Issues

Novell, Inc. All rights reserved. #'
7eneral So.tware Sta1
)
,lease avoid chasing already solved "roblemsO
)
,lease a""ly all available soft!are u"datesC
-
S&S#S 'inu$ #nter"rise Server 55
-
S&S# 'inu$ #nter"rise High Availability #$tension
)
Consider migrating to S&S# 'inu$ #nter"rise High
Availability #$tension, if you have not already.
-
&sability, ease of setu", integration are all much im"roved.
-
S&S# 'inu$ #nter"rise Server 56 remains fully su""orted.

Novell, Inc. All rights reserved. #*
%ro& <ne to Many 3odes
)
Error= Con.iguration .iles not identi1al a1ross nodes@
-
7etc7drbd.conf, 7etc7corosync7corosync.conf, 7etc7ais7o"enais.conf,
resource-s"ecific configurations ...
)
Sym"tomsC Causes !eird misbehavior, !or+s one but
not on other systems, intero"erability issues, and
"ossibly others.
)
SolutionC /a+e sure they are synchroni0ed.
-
S&S#S 'inu$ #nter"rise High Availability #$tension 55 S,5
"rovides T1syn1#U to do this automatically for you.
2
;ou can add your o!n files to this list as needed.

Novell, Inc. All rights reserved. #-
3etworing
)
S!itches must su""ort multicast "ro"erly.
)
9onding is "referable to using multi"le ringsC
-
%educes com"le$ity
-
#$"oses redundancy to all cluster services and clients
)
(ire!all rules are not your friend.
)
Hee" firm!are on s!itches u"todateO
)
/a+e NIC names identical on all nodes

Novell, Inc. All rights reserved. #/
%en1ing AST<38THB
)
#rrorC Not configuring S.3NI.H at all
-
It defaults to enabled, resource start-u" !ill bloc+ and the
cluster sim"ly do nothing. .his is for your o!n "rotection.
)
<arningC *isabling S.3NI.H
-
*'/73C(S4 !ill bloc+ forever !aiting for a fence that is never
going to ha""en.
)
#rrorC &sing Te$ternal7sshU, TsshU, TnullU in "roduction
-
.hese "lug-ins are for testing. .hey !ill not !or+ in "roductionO
-
&se a TrealU fencing device or e$ternal7sbd
)
#rrorC configuring several "o!er s!itches in "arallel.
)
#rrorC .rying to use e$ternal7sbd on *%9*

Novell, Inc. All rights reserved. #4
)8! )on.iguration 8ssues
)
4 node clusters cannot have maMority !ith 5 node failed
-
V crm configure "ro"erty no-1uorum-"olicyWignore
)
%esources are starting u" in TrandomU order or on
T!rongU nodes
-
Add re1uired constraintsO
)
%esources move around !hen
something TunrelatedU changes
-
V crm configure "ro"erty default-resource-stic+inessW5666
)
V crm@verify -' P "test -' -8888
-
<ill "oint out some basic issues
<eGll
get bac+
to that ...

Novell, Inc. All rights reserved. ;5
)on.iguring )luster :esour1es
)
Sym"tomC 3n start of one or more nodes, the cluster
restarts resourcesO
)
CauseC resources under cluster control are also started
via the TinitU se1uence.
-
.he cluster T"robesU all resources on start-u" on a node, and
!hen it finds resources active !here they should not be -
"ossibly even more than once in the cluster -, the recovery
"rotocol is to sto" them all Kincluding all de"endenciesL and
start them cleanly again.
)
SolutionC %emove them from the TinitU se1uence.

Novell, Inc. All rights reserved. ;1
Setting :esour1e Ti&e-outs
)
!elie.C CShorter ti&e-outs &ae the 1luster
respond .aster@D
)
%a1t= .oo short time-outs cause resource o"erations
to TfailU erroneously, ma+ing the cluster unstable
and un"redictable.
-
A some!hat too long time-out !ill cause a fail-over delayP
-
a slightly too short time-out !ill cause an unne1essary
servi1e outage@
)
Consider that a loaded cluster node may be slo!er
than during de"loyment testing.
-
Chec+ Tcrm@mon -t5U out"ut for the actual run-times
of resources.

Novell, Inc. All rights reserved. ;#
<)%S#
)
&sing ocfs4-tools-o4cb Klegacy modeL
-
34C9 only !or+s !ith 3racle %ACP full features of S&S#S 'inu$
#nter"rise High Availability #$tension are only available in
combination !ith ,acema+er
-
V zypper rm ocfs2-tools-o2cb
-
(orget about 7etc7ocfs47cluster.conf, 7etc7init.d7ocfs4, 7etc7init.d7o4cb
and 7etc7sysconfig7ocfs4
)
Nodes crash on shutdo!n
-
If you have active ocfs4 mounts, you need to umount before shutdo!n
-
If o"enais is "art of the boot se1uence
2
V insserv openais
)
ConsiderC *o you really need 3C(S4N
-
Can your a""lication really run concurrentlyN

Novell, Inc. All rights reserved. ;;
(istributed :epli1ated !lo1 (evi1e
)
/ythC has no shared state, thus no S.3NI.H needed.
-
%a1t= *%9* still needs fencingO
)
Active7ActiveC
-
*oes not magically ma+e e$tE or a""lications
concurrency-safe, still can only be mounted once
-
<ith 3C(S4, s"lit-brain is still fatal, as data divergesO
)
Active7,assiveC
-
#nsures only one side can modify data, added "rotection.
-
*oes not magically ma+e a""lications crash-safe.
)
IssueC %e"lication traffic during reads.
-
TnoatimeU mount o"tion.

Novell, Inc. All rights reserved. ;$
Storage in 7eneral
)
Activating non-battery bac+ed caches for "erformance
-
Causes data 1orruption.
)
iSCSI over unreliable net!or+s.
)
'ac+ of multi"ath for storage.
)
9elieving that %AI* re"laces bac+u"s.
-
%AI* and re"lication immediately "ro"agate logical errorsO
)
,lease ensure that device names are identical on
all nodes.
#$"loring the #ffect of #vents

Novell, Inc. All rights reserved. ;*
6hat Are Events>
)
All state changes to the cluster are events
-
.hey cause an u"date of the CI9
-
Configuration changes by the administrator
-
Nodes going u" or going do!n
-
%esource monitoring failures
)
%es"onse to events is configured using the CI9
"olicies and com"uted by the ,olicy #ngine
)
.his can be simulated using "test
-
Available comfortably through the TcrmU shell

Novell, Inc. All rights reserved. ;-
Si&ulating 3ode %ailure
hex-0:~ # crm
crm(live)# cib new sandbox
INFO: sandbox shadow CIB created
crm(sandbox)# cib cibstatus node hex-0
unclean
crm(sandbox)# ptest

Novell, Inc. All rights reserved. ;/
Si&ulating 3ode %ailure

Novell, Inc. All rights reserved. ;4
crm(sandbox)# cib cibstatus load live
crm(sandbox)# cib cibstatus op
usage= op EoperationF Eresour1eF EexitG1odeF HEopGstatusFI
HEnodeFI
crm(sandbox)# cib cibstatus op start
dummy1 not_running done hex-0
crm(sandbox)# cib cibstatus op start
dummy1 unknown timeout hex-0
crm(sandbox)# configure ptest
ptest[4918]: 2010/02/17_12:44:17 WARN: unpack_rsc_op:
Processing failed op dummy1_start_0 on hex-0: unknown error (1)
Si&ulating :esour1e %ailure

Novell, Inc. All rights reserved. $5
Si&ulating :esour1e %ailure

Novell, Inc. All rights reserved. $1
Exploring )on.iguration )hanges
crm(sandbox)# cib cibstatus load live
crm(sandbox)# configure primitive dummy42
ocf:heartbeat:Dummy
crm(sandbox)# ptest

Novell, Inc. All rights reserved. $#
)on.iguration )hanges - 6oahJ

Novell, Inc. All rights reserved. $;
Exploring )on.iguration )hanges
crm(sandbox)# configure rsc_defaults
resource-stickiness=1000
crm(sandbox)# ptest
crm(sandbox)# configure order order-42
inf: dummy42 dummy1
crm(sandbox)# ptest

Novell, Inc. All rights reserved. $$
)on.iguration )hanges K Al&ost @@@

Novell, Inc. All rights reserved. $'
)on.iguration )hanges - (one
'og (iles and .heir /eaning

Novell, Inc. All rights reserved. $-
hbGreport 8s the Silver Support !ullet
)
Com"iles
-
Cluster-!ide log files,
-
,ac+age state,
-
*'/73C(S4 state,
-
System information,
-
CI9 history,
-
,arsed core dum" re"orts, into a single tarball for all
su""ort needs.
)
V hb_report -n node1 node2 node3 -f 12:00
/tmp/hb_report_eample1

Novell, Inc. All rights reserved. $/
Logging
)
T.he cluster generates too many log messagesOU
-
Alas, customers are even more u"set !hen as+ed to re"roduce
a "roblem on their "roduction system P-L
-
Incidentially, all command line invocations are logged.
)
System-!ide logsC 7var7log7messages
)
CI9 historyC 7var7lib7"engine7X
-
All cluster events are logged here and can be analy0ed !ith
hindsight K"ython &I, "test, and the crm shellL.

Novell, Inc. All rights reserved. $4
6here 8s the :eal )ause>
.he ans!er is al!ays in the logs.
#ven though the logs on the *C may "rint a reference
to the error, the real cause may be on another node.
/ost errors are caused by resource agent
misconfiguration.

Novell, Inc. All rights reserved. '5
)orrelating Messages to Their )ause
)
%eb 1- 1;=5*='- hex-/ pengine= H--1-I= 6A:3=
unpa1Grs1Gop= ,ro1essing .ailed op o1.s#-
1=#G&onitorG#5555 on hex-5= not running A-B
-
.his is not the failure, Must the ,olicy #ngine re"orting on the
CI9 stateO .he real messages are on he$-6, gre" for the
o"eration +eyC
)
%eb 1- 1;=5*='- hex-5 %ilesyste&H#$/#'I= H#$/*1I= 83%<= 2.iler
is un&ounted AstoppedB
)
%eb 1- 1;=5*='- hex-5 1r&d= H-;;$I= in.o= pro1essGlr&Gevent=
L:M operation o1.s #-1=#G&onitorG#5555 A1allL;-M r1L-M 1ib-
updateL''M 1on.ir&edL.alseB not running
-
'oo+ for the error messages from the resource agent before the
lrmd7"engine linesO
*ebugging %esource Agents

Novell, Inc. All rights reserved. '#
)o&&on :esour1e Agent 8ssues
)
3"erations must succeed if the resource is already
in the re1uested state.
)
TmonitorU must distinguish bet!een at least
Trunning73HU, Trunning7failedU, and Tsto""edU
-
,robes deserve s"ecial attention
)
/eta-data must conform to *.*.
)
Erd "arty resource agents do not belong under
7usr7lib7ocf7resource.d7heartbeat - chose your o!n
"rovider nameO
)
&se ocf-tester to validate your resource agent.

Novell, Inc. All rights reserved. ';
o1.-tester Exa&ple <utput
hex-0:~ # ocf-tester -n Example
/usr/lib/ocf/resource.d/bs2010/Dummy
Beginning tests for /usr/lib/ocf/resource.d/bs2010/Dummy...
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* rc=7: Stopping a stopped resource is required to succeed
Tests failed: /usr/lib/ocf/resource.d/bs2010/Dummy failed 1
tests
=uestions and Ans!ers
Unpublished 6or o. 3ovellM 8n1@ All :ights :eserved@
.his !or+ is an un"ublished !or+ and contains confidential, "ro"rietary, and trade secret information of Novell, Inc.
Access to this !or+ is restricted to Novell em"loyees !ho have a need to +no! to "erform tas+s !ithin the sco"e
of their assignments. No "art of this !or+ may be "racticed, "erformed, co"ied, distributed, revised, modified,
translated, abridged, condensed, e$"anded, collected, or ada"ted !ithout the "rior !ritten consent of Novell, Inc.
Any use or e$"loitation of this !or+ !ithout authori0ation could subMect the "er"etrator to criminal and civil liability.
7eneral (is1lai&er
.his document is not to be construed as a "romise by any "artici"ating com"any to develo", deliver, or mar+et a
"roduct. It is not a commitment to deliver any material, code, or functionality, and should not be relied u"on in
ma+ing "urchasing decisions. Novell, Inc. ma+es no re"resentations or !arranties !ith res"ect to the contents
of this document, and s"ecifically disclaims any e$"ress or im"lied !arranties of merchantability or fitness for any
"articular "ur"ose. .he develo"ment, release, and timing of features or functionality described for Novell "roducts
remains at the sole discretion of Novell. (urther, Novell, Inc. reserves the right to revise this document and to
ma+e changes to its content, at any time, !ithout obligation to notify any "erson or entity of such revisions or
changes. All Novell mar+s referenced in this "resentation are trademar+s or registered trademar+s of Novell, Inc.
in the &nited States and other countries. All third-"arty trademar+s are the "ro"erty of their res"ective o!ners.