You Bought RAC, Now How Do You Get HA

Erik Peterson
RAC Development
You bought RAC,
now how do you
get HA?
HA Architecture
HA Outside the Database
Rolling Upgrades
Best Practices
Oracles Integrated HA Solutions
System
Failures
Data
Failures
System
Changes
Data
Changes
Unplanned
Downtime
Planned
Downtime
Real Application Clusters
ASM
Flashback
RMAN & Oracle Secure Backup
H.A.R.D
Data Guard
Streams
Online Reconfiguration
Rolling Upgrades
Online Redefinition
O
r
a
c
l
e

M
A
A

B
e
s
t

P
r
a
c
t
i
c
e
s
Data Guard + RAC Configuration
Data Guard + RAC: end-to-end Data Protection and HA
Basis of Maximum Availability Architecture
Managed as a single configuration
Broker
Standby
Database
Standby Site
Primary
Database
Primary Site
R
A
C
R
A
C
Data Guard
HA Architecture Components
Redundant middle or application tier
Redundant network & interconnect
infrastructure
Redundant storage infrastructure
Real Application Clusters (RAC) to protect
from host and instance failures
Data Guard (DG) to protect from human
errors and data failures
Sound operational practices
Oracle RAC
HA Architecture
Clustered
Database Instances
Mirrored Disk
Subsystem
High Speed
Switch or
Interconnect
Hub or
Switch
Fabric
Application Servers/
Network
Centralized
Management
Console
Storage Area Network
Low Latency Interconnect
VIA or Proprietary
Drive and Exploit
Industry Advances in
Clustering
Users
No Single
Point Of Failure
Shared C
ache
Keeping non-Oracle
components HA
Making Applications Resilient
HA Outside the Database
Oracle Clusterware
Provides Application HA
Agent Framework
Please Start
How are you ?
Please Stop
check
start
stop
An example of an Agent used
by the Framework
An Agent to protect an Apache Web Server
The start command
Would invoke the apache command
apachectl k start
Perhaps with a f parameter to locate the configuration file on
shared disk
The check command
There are a number of things that could be checked
- Is the httpd process running ? (cheating)
- Can I request a web page ? (mostly programmatic)
The stop command
Would invoke the apache command
apachectl k stop
start
stop
check
Oracle Clusterware Agents
xclock
Apache
OC4J
SAP Agent
Application VIP
For more details see Using Oracle Clusterware to Protect 3rd
Party Applications on OTN
Making Applications Resilient
1. Implement Services
2. Use FAN and LBA aware connection pool
3. Add FAN callouts cleanup, failback
4. Modify application to retry
Gains: Manages Priorities, Visibility of Use,
Can turn off reporting during aggregation
Services by Priority
Node-1 Node-2 Node-3
Standard Queries
Adhoc Queries
ETL
Database Services for DW
Making your Application React
Connect SQL issue Blocked in R/W Processing last
result
active active wait wait
tcp_ip_cinterval tcp_ip_interval tcp_ip_keepalive -
VIP VIP out of band event -
FAN
out of band event -
FAN
VIP Resources
VIP resources existed since Oracle Database 10gR1
Only used to failover the VIP to another node so that a
client got an instant NAK returned when it tried to connect
to the Virtual IP. They still exist and operate in the same
way in Release 2.
Application VIPs
New resource in Oracle Database 10gR2
Created as functional VIPs which can be used to connect
to an application regardless of the node it is running on
VIP is a dependent resource of the user registered
application
There can be many VIPs, one per User Application
Fast Connection Failover
FAN & Oracle Connection Pools
C2
C3
C4
C5
C6
C7
C8
C9
C1
Instance 1
Instance 2
Instance 3
Connection Pool
Node Leaves
C2
C3
C4
C5
C6
C7
C8
C9
C1
Instance 1
Instance 2
Connection Pool
FAN
Node Leaves
C2
C3
C4
C5
C6
C7
C8
C9
C1
Instance 1
Instance 2
Connection Pool
FAN
End State
C2
C3
C4
C5
C6
C1
Instance 1
Instance 2
Connection Pool
Instance Join
C2
C3
C4
C5
C6
C7
C8
C9
C1
Instance 1
Instance 2
Instance 3
Service across RAC
Currently limited to Oracle
JDBC connection pool
FAN
Connection Pool
Clients
Pools
1
10
3 4 FAN
7 8 commit
2
5 error
6 retry
9
11 Response-time > SL to DR
RAC
ASM
10g Services
Connection
Caches
Application
Servers
Application
Example
Runtime Connection Load
Balancing
Solves the Connection Pool problem!
Easiest way to take advantage of Load
Balancing Advisory
No application changes required
No extra charge software to buy
Enabled by parameter on datasource
definition
Supported by JDBC and ODP.NET
Runtime Connection Load
Balancing
Client connection pool is integrated with RAC
load balancing advisory
When application does getConnection, the
connection given is the one that will provide
the best service.
Policy defined by setting GOAL on Service
Need to have Connection Load Balancing
Load Balancing Advisory Goals
THROUGHPUT Work requests are directed based
on throughput.
used when the work in a service completes at homogenous
rates. An example is a trading system where work
requests are similar lengths.
SERVICE_TIME Work requests are directed based
on response time.
used when the work in a service completes at various
rates. An example is as internet shopping system where
work requests are various lengths
None Default setting, turn off advisory
Runtime Connection Load Balancing
with JDBC, ODP.NET
Instance 1 Instance 2 Instance 3
connection
cache
CRM is
bored
CRM is
busy
CRM is
very
busy
?
30%
60%
10%
CRM requests connection
FCF is faster
Applications that retry deal w/
transactional case
TAF or FCF
Application Discrete Components
OS, Hardware Changes Full
Clusterware Changes Full
ASM, DB Limited
Storage Upgrades Full
Rolling Upgrades
Services Provides
Application Independence
DW
OLTP 1
OLTP 2
OLTP 3
OLTP 4
Node-4 Node-3 Node-2 Node-1
Node-6 Node-5
Batch Repor
ting
One service brought offline for upgrade while others are still available
Rolling Patch Upgrade using RAC
Initial RAC Configuration Clients on A, Patch B
Oracle
Patch
Upgrades
Operating
System
Upgrades
Upgrade Complete
Hardware
Upgrades
Clients Clients
Clients on B, Patch A
Patch
1 2
3 4
A B
A
A B
B
B
A
A
Patch
B B
B
A
A
Oracle
Clusterware
Upgrades
SQL Apply Rolling Database Upgrades
Major
Release
Upgrades
Patch Set
Upgrades
Cluster
Software &
Hardware
Upgrades
Initial SQL Apply Config
Clients
Redo
Version X Version X
1
B A
Switchover to B, upgrade A
Redo
4
Upgrade
X+1 X+1
B A
Run in mixed mode to test
Redo
3
X+1 X
A B
Upgrade node B to X+1
Upgrade
Logs
Queue
X
2
X+1
A B
Storage Migration Without Downtime
Migration from existing
storage to new storage can be
done by ASM with less
complexity and no downtime
Reduce cost of disk as data
becomes less active (ILM)
Disk Group
New storage is added
to ASM disk group
Disk Group
Automatic online rebalance
provides online migration to
new storage
Dropping other disks enable
migration to complete
Disk Group
completes
Old disks are empty
Disk Group
completes
Old disks are empty and
removed from disk group
Disk Group
More Nodes
Common Setup
RAC Setup Recommendations
Verifying Setup
Keeping your System Up
Best Practices
More Nodes = Greater HA
Email
Payroll
OE
CRM
More Nodes = Greater HA
Email
Payroll
OE
CRM
Node failure has less impact
Go with the Most Common Setup
Source: Oracle RAC Developments Customer Database for
10g RAC implementations
RAC Setup Recommendations
Full Redundancy Networks, Switch, etc.
Bonding/Teaming your Networks/SAN
Use a Switch
Mirroring OCR
Setup 3 Voting Disks
User sets up the
Hardware,
network & storage
Sets up OCFS
( OPT )
Installs
CRS
Installs
RAC
Configures
RAC DB
-post hwos
-post cfs
-post crsinst
-pre crsinst
-pre dbinst
-pre dbcfg
-pre cfs
Verifying Setup
Use Cluster Verification Utility
Verifying Setup
CVU List of Tests
$> ./cluvfy comp -list
Valid components are:
nodereach : checks reachability between nodes
nodecon : checks node connectivity
cfs : checks CFS integrity
ssa : checks shared storage accessibility
space : checks space availability
sys : checks minimum system requirements
clu : checks cluster integrity
clumgr : checks cluster manager integrity
ocr : checks OCR integrity
crs : checks CRS integrity
nodeapp : checks node applications existence
admprv : checks administrative privileges
peer : compares properties with peers
Servers
Storage
Inst ASM
Inst DB
Inst ASM
Inst DB
Verifying Setup:
Destructive Testing
Crash of a client connection
Crash of 1 element
of the interconnect
Crash of the
SAN connection
Crash of the
DB instance
Crash of the
ASM instance
Crash of a
storage
system
Crash of
each fiber
card
Adhere to strong Systems Life Cycle
Disciplines
Comprehensive test plans (functional and stress)
Rehearsed production migration plan
Change Control
Separate environments for Dev, Test, QA/UAT,
Production
Backup and recovery procedures
Security controls
Support Procedures
Test Clusters
When
Any application, parameter or component
change
Why
Doesnt Oracle/OS Vendor/HW Vendor
test?
Your Unique Environment
What happens if you dont?
Change Management
Plan changes to minimize downtime and service
disruption
May mean overnight or weekend work
Avoid critical periods in application cycles, such as month-
or year-end processing
Consider staged changes
One function at a time
One node at a time (if possible)
Include time and resources to back out changes if
necessary
Summary
To build the a full HA architecture with Real
Application Clusters:
Use Services & FAN
Use the full stack: Oracle Clusterware, ASM
Validate your environment w/ every change
Follow MAA Best Practices
Have a Test Cluster

You Bought RAC, Now How Do You Get HA

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

You Bought RAC, Now How Do You Get HA

Enviado por

Direitos autorais:

Formatos disponíveis

Erik Peterson

Você também pode gostar