Rac Internals

<Insert Picture Here>
Understanding RAC Internals

Barb Lundhild
Oracle Corporation
RAC Product Management
The following is intended to outline our general

product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.
1
Agenda
1. What are the major components of Oracle Clusterware and
how do they interact?
2. Why does Oracle reboot nodes?
3. How does Oracle handle private interconnect failure and
scalability?
4. When my public network fails, why does ASM and the db
instance get shut down?
5. What exactly is the VIP, it’s purpose, and how does it
work?
6. What is the purpose of ONS – is it required for anything
other than FAN?
7. How does Oracle do load balancing across RAC
instances?
What are the major

components of Oracle
Clusterware and how do they
interact?
2
RAC 10 Architecture
public network
VIP1 VIPn
Service Service Node n
Node1
Listener Listener
instance 1 instance n
ASM ASM
cluster
Oracle Clusterware interconnect Oracle Clusterware
Operating System Operating System
shared storage
Redo / Archive logs all instances
Managed by ASM
Database / Control files
RAW Devices OCR and Voting Disks
What does Clusterware provide?
VIP
Event Management
High Availability
Clusterware
Framework
Process Monitor
Group Membership
Operating System
3
Oracle Clusterware 10 Architecture
VIP
EVM
RACG
Oracle
CRS
Clusterware
OPROC
CSS
Operating System
Why does Oracle

Clusterware reboot
nodes?
4
Oracle Clusterware
Group Membership and Heartbeats
• Cluster needs to know who is a member at all times

• Oracle Clusterware has 2 heartbeats:
• Network heartbeat and Disk heartbeat
• If a node does not send a network heartbeat for
<MissCount> (time in seconds), then node is evicted
from cluster
• If disk heartbeat (voting disk) is not updated in <I/O
timeout>, then node is evicted from cluster
Heartbeat Failures
• Network Heartbeat
node(4) missed(59) checkin(s)
>2005-06-18 08:14:37.858 [3002575792]
>WARNING: clssnmPollingThread:
Eviction started for node 4,flags 0x000d,
>state 3, wt4c 0
>2005-06-18 08:14:41.985 [3047074736]
>TRACE: clssnmHandleSync:
• Disk Heartbeat
CSSD]2005-10-11 15:56:23.668 [93645744]
>WARNING: clssnmDiskPMT: long disk latency
>(45940 ms) to voting disk (0//dev/raw/raw1)
5
Oracle Clusterware
Split Brain Resolution
• Split Brain Resolution:

• Determine surviving subcluster
• Sub-cluster with largest number of Nodes
• Sub-cluster with lowest node number
• IO Fencing via Stonith algorithm (remote power reset)
• Voting disk is used to detect and resolve network
problems that could lead to a split-brain
• Final arbiter of the status of configured nodes, either up or
down, and delivers eviction notices
• Recommended to have at least 3 voting disks
• Multiple voting disks supported in RAC 10g Release 2
• Dynamic addition of voting disk RAC 11g
Oracle Clusterware
Disk Heartbeat
• Disktimeout: maximum time (s) for voting file I/O to
complete.
• 10g Release 1 and 10.2.0.1 I/O timeout was directly related to
MissCount.
• I.E. MissCount governed sensitivity of both heartbeats
• 10.2.0.2– more granular sensitivity via separation of network
and disk heartbeats
• Disktimeout parameter set for CSS, default = 200s
• Tune disktimeout for the Voting Disk storage solution
• be careful - some multipathing solutions require high
disktimeout values
6
Changing MissCount
• IT IS NOT SUPPORTED TO REDUCE MISSCOUNT

BELOW THE DEFAULT
• Default varies somewhat by platform (30s or 60s)
• Default = 600s if vendor clusterware is installed
• It should not be necessary to tune Disktimeout
How does Oracle handle

private interconnect
failure and scalability?
7
Private Interconnect
/…/
public network
VIP1 VIP2 VIPn
Service Service Node 2 Service Node n
Node1
Listener Listener Listener
instance 1 instance 2 instance n
ASM ASM ASM
Oracle Clusterware Oracle Clusterware Oracle Clusterware
Operating System Operating System Operating System
Switch 1 Switch 2
cluster
interconnect
Private Interconnect
• Network between the nodes of a RAC cluster MUST

be private
• Supported links: GbE, IB ( IPoIB: 10.2 )
• Supported transport protocols:
• Oracle Clusterware uses TCP
• RAC: UDP, RDS (10.2.0.3)
• Use multiple or dual-ported NICs for redundancy and
increase bandwidth with NIC bonding
• Large ( Jumbo ) Frames for GbE recommended
8
Interconnect Bandwidth
• Bandwidth requirements depend on

• CPU power per cluster node
• Application-driven data access frequency
• Number of nodes and size of the working set
• Data distribution between PQ slaves
• Typical utilization approx. 10-30% in OLTP
• 10000-12000 8K blocks per sec to saturate 1 x Gb Ethernet
( 75-80% of theoretical bandwidth )
• Multiple NICs generally not required for performance
and scalability
IPC configuration
• Settings:
• Socket receive buffers ( 256 KB – 1MB )
• Negotiated top bit rate and full duplex mode
• NIC ring buffers
• Ethernet flow control settings
• CPU(s) receiving network interrupts
• Verify your setup:
• CVU does checking
• Load testing eliminates potential for problems
9
Interconnect Bonding
• Terminology: NIC Bonding, link aggregation, port
trunking, NIC teaming, …
• Multiple physical links combined into a single logical
link
• Provides redundancy and/or scalability
• Logical link is provided to Oracle Clusterware and
RAC
• Most operate at OSI Layer 2
• Different implementations on different platforms
• Read the fine print
• Generally recommend failover only (active/passive)
configuration
Interconnect Bonding
• Some cluster managers provide support for multiple
interconnects
• Not required with Oracle Clusterware
• OS-Specific bonding
• Solaris: IPMP, Sun Trunking
• AIX: etherchannel
• HP-UX: APA
• Linux: NIC Bonding
• Windows: NIC Teaming
• IB drivers inherently support failover and load balancing.
10
Interconnect Configuration
• OCR
[SYSTEM.css.interfaces.global.bond0.192|d168|d12|d0.1]
ORATEXT : cluster_interconnect
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION :
PROCR_ALL_ACCESS, OTHER_PERMISSION : PROCR_READ, USER_NAME :
oracle, GROUP_NAME : odba}
• RDBMS
SQL> select * from x$ksxpia;
ADDR INDX INST_ID P PICK NAME_KSXPIA IP_KSXPIA

-------- ---------- ---------- - ---- --------------- -------------
58EC8340 0 1 Y OCR bond0 192.168.12.1
• cluster_interconnects (init.ora for RAC)

• Overrides clusterware setting
• Supports load balancing, not failover
Operating System Dependency
• Block access latencies increase when CPU(s) busy

and run queues are long
• Immediate LMS scheduling is critical for predictable
block access latencies when CPU > 80% busy
• Fewer and busier LMS processes may be more
efficient. i.e. monitor their CPU utilizaiion
• Real Time or fixed priority for LMS is supported
• Implemented by default with 10.2
• Do not put more instances than ½ CPU’s on a server
11
Misconfigured or Faulty Interconnect
Can Cause:
• Dropped packets/fragments
• Buffer overflows
• Packet reassembly failures or timeouts
• Ethernet Flow control kicks in
• TX/RX errors
“lost blocks” at the RDBMS level, responsible for

64% of escalations
“Lost Blocks”: NIC Receive Errors
Db_block_size = 8K
ifconfig –a:
eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04
inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95
TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0
12
“Lost Blocks”: IP Packet Reassembly
Failures
netstat –s
Ip:
84884742 total packets received
…
1201 fragments dropped after timeout
…
3384 packet reassembles failed
Finding a Problem with the

Interconnect or IPC
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time(s)(ms) Time Wait Class
----------------------------------------------------------------------------------------------------
log file sync 286,038 49,872 174 41.7 Commit
gc buffer busy 177,315 29,021 164 24.3 Cluster
gc cr block busy 110,348 5,703 52 4.8 Cluster
gc cr block lost 4,272 4,953 1159 4.1 Cluster
cr request retry 6,316 4,668 739 3.9 Other
Should never be here
13
What are the

startup/shutdown
sequence and
dependencies?
Node Startup Sequence
3
VIP1
7 Service
6 Listener
5 Instance 1
4 ASM
2 Oracle Clusterware
1 Operating System
14
Oracle Dependencies
Prior to 10.2.0.3
public network
VIP1 VIP2
Service Service Node2
Node1
Listener Listener
instance 1 instance 2
ASM ASM
cluster
shared storage
Managed by ASM
Oracle Dependencies
Prior to 10.2.0.3
public network
VIP1 VIP1 VIP2
Service Service Node2
Node1
Listener Listener
ASM ASM
cluster
shared storage
Managed by ASM
15
Oracle Dependencies
public network
VIP1 VIP2
Service Service Node 2
Node1
Listener Listener
ASM ASM
cluster
shared storage
Managed by ASM
Oracle Dependencies
public network
VIP1 VIP1 VIP2
Service Service Node 2
Node1
Listener Listener
ASM ASM
cluster
shared storage
Managed by ASM
16
What exactly is the VIP,

it’s purpose, and how
does it work?
Why Oracle RAC 10g has a VIP?
• Protects database clients from long TCP/IP timeouts

(can be >10 minutes)
• During normal operation, works the same as
hostname
• During failure, it removes network timeout from
connection request time, client fails immediately to
next address in the list
sales.us.acme.com =(DESCRIPTION=(ADDRESS_LIST=
(LOAD_BALANCE=on)(FAILOVER=ON)
(ADDRESS=(PROTOCOL=tcp)(HOST=sales1-vip)(PORT=1521))
(ADDRESS=(PROTOCOL=tcp)(HOST=sales2-vip)(PORT=1521)))
(CONNECT_DATA=
(SERVICE_NAME=sales.us.acme.com)))
17
Oracle RAC 10g VIP
The Details!
• One for each node in cluster
• Required for Oracle Clusterware installation
• IP and network name should not currently be in use
• Should be registered in DNS and be on the same
subnet as public IP address
• Can use OS bonding to provide failover and load
balancing on network interfaces on the node
• Configuration managed by VIPCA
• Note that netmask defaults to 255.255.255.0, rather
than defaulting to netmask of underlying physical
interface.
Oracle RAC VIP is DIFFERENT
• Only accepts connections when on its home node
• Failure on home node: relocates to another node in

the cluster only to send a error back to client (it will
not be in the listener so it cannot accept connections!)
• You will only have one active RAC VIP per node
(there may be others who have relocated due to
failure!)
• Independent of number of databases running in cluster
18
Oracle RAC 10g VIP
[root@pmrac1 root]# ifconfig

eth0 Link encap:Ethernet HWaddr 00:12:79:D8:90:93
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
collisions:0 txqueuelen:1000
RX bytes:509963813 (486.3 Mb) TX bytes:3621223517 (3453.4 Mb)
Interrupt:25
eth0:1 Link encap:Ethernet HWaddr 00:12:79:D8:90:93

VIP UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
collisions:0 txqueuelen:1000
RX bytes:3400642002 (3243.1 Mb) TX bytes:3166774792 (3020.0 Mb)
Interrupt:25
Listener.ora
SID_LIST_LISTENER_PMRAC1 =
(SID_LIST =
(SID_DESC =
(SID_NAME = PLSExtProc)
(ORACLE_HOME = /u01/oracle/product/10gR2/asm)
(PROGRAM = extproc)
)
)
LISTENER_PMRAC1 =
(DESCRIPTION_LIST =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1))
(ADDRESS = (PROTOCOL = TCP)(HOST = pmrac1-vip)(PORT = 1521)(IP = FIRST))
VIP (ADDRESS = (PROTOCOL = TCP)(HOST = 144.25.214.45)(PORT = 1521)(IP = FIRST))
)
)
19
Application VIPs
• New resource in Oracle RAC 10g Release 2

• Created as functional VIPs which can be used to
connect to an application regardless of the node it is
running on
• VIP is a dependent resource of the user registered
application
• There can be many VIPs, one per User Application
Creating an Application VIP
• The usrvip script must run as root

• The default permissions need to be changed
• As root…
crs_setperm ApplicationVIP1 –o root
• Allow oracle user to execute this script
• As root…
crs_setperm ApplicationVIP1 –u user:oracle:r-x
• Start the VIP
• As oracle…
crs_start ApplicationVIP1
20
What is the purpose of

ONS – is it required for
anything other than
FAN?
Oracle Notification Service (ONS)
• Publish/Subscribe Messaging System

• Allows both local and remote consumption
• Used by Fast Application Notification (FAN) to publish
HA Events and Load Balancing Events
• Used by FAN clients to subscribe to events
• Automatically installed and configured by the
installation of Oracle Clusterware
• DO NOT TURN OFF – Required by Oracle
Clusterware and RAC
21
What is FAN?
• Fast Application Notification (FAN) is a RAC

notification mechanism
• FAN HA Events: Notification of Up/Down for service,
instance & node
• Load Balancing Advisory Events: Advise clients of
current load for service and where to send
connection requests
• Enable it, and Forget it.
Fan Clients
• HA Events: JDBC Implicit Connection Cache, OCI,

ODP.NET Connection Pools, Listener, Server Side
Callouts, CMAN
• Load Balancing Advisory Events: JDBC Implicit

Connection Cache, ODP.NET Connection Pools,
Listener, CMAN
• New in RAC 11g – OCI Session Pools subscribe to Load
Balancing Advisory Events to provide Runtime Connection
Load Balancing
22
How does Oracle do load

balancing across RAC
instances?
Connection Load Balancing
LISTENER
Service OLTP?
OLTP1 on N1
Network
Network OLTP2 on N2
Application
Server OLTP3 on N3
RAC Database
23
Connection Load Balancing
LISTENER
Connection
made to
ork
tw
OLTP1 Ne
Listeners
RAC
Clients Database
Connection Pools
How do you Load Balance?
c c
c c
c
c cc c
c
c c
Application Connection Pool

Real Application Clusters
24
Load Balancing Advisory
• Load Balancing Advisory is an advisory for balancing

work across RAC instances.
• Load Balances at the transaction level (not
connections!)
• Directs work to where services are executing
well and resources are available.
• Adjusts distribution for different power nodes,
different priority and shape workloads, changing
demand.
• Stops sending work to slow, hung, failed nodes
early.
• Automatic Workload Repository

• Calculates goodness locally, forwards to master
mmon
• Master mmon builds advisory for distribution of work
• Records advice to SYS$SERVICE_METRICS
• Posts FAN event to AQ, PMON, ONS
25
View LBA FAN Event
Runtime Connection Load Balancing
• When application does “getConnection”, the

connection given is the one that will provide
the best service.
• Supported by Oracle JDBC and ODP.NET
connection Pools (OCI Session Pools in RAC
11g!)
• Policy defined by setting GOAL on Service
• Need to have Connection Load Balancing
26
Enabled through Service Goal
• THROUGHPUT – Work requests are directed based on
throughput .
• used when the work in a service completes at homogenous
rates. An example is a trading system where work requests
are similar lengths.
• SERVICE_TIME – Work requests are directed based
on response time.
• used when the work in a service completes at various rates.
An example is as internet shopping system where work
requests are various lengths
• None – Default setting, turn off advisory
Fast Connection Failover
• Fast and reliable high availability for connections in an

Oracle Real Application Clusters 10g environment
• Enable it and forget it
• Application can make it transparent to user by

trapping SQL Exception and retrying
• Supported by Oracle JDBC, OCI, and ODP.NET
27
FAN/FCF Client Integration
JDBC
• When DOWN signal received from RAC 10g

• First pass: Connections are marked as down
• Second pass: Aborts and removes connections that are marked as
down
• Routes new requests to surviving instances
• Throws exception if application was in midst of transaction
• When UP signal received from RAC 10g

• Creates new connections to new instances
• Distributes new work requests evenly to all available instances
Q&
A
QUESTIONS
ANSWERS
28
Appendix
For More Information
http://search.oracle.com
REAL APPLICATION CLUSTERS
or
otn.oracle.com/rac
29
Useful Metalink Notes
• Note 342082.1 “How to Change Subnet Masks for VIPs”

• Note 294430.1 “CSS Timeout Computation in RAC 10g ”
• Note 284752.1 “10g RAC: Steps To Increase CSS Misscount,
Reboottime and Disktimeout”
• Note 291962.1 ‘Setting Up Bonding in SLES 9’
• Note 291958.1 ‘Setting Up Bonding in Suse SLES8’
• Note 298891.1 ‘Configuring Linux for the Oracle 10g VIP using
bonding’
• Note 283107.1 ‘Configuring Solaris IP Multipathing (IPMP) for
the Oracle 10g VIP’
OTN.ORACLE.COM/RAC
• Workload Management with Oracle Real Application

Clusters (FAN, FCF, Load Balancing)
• Using standard NFS to support a third voting disk on a
stretch cluster configuration on Linux
• Using Oracle Clusterware to Protect 3rd Party
Applications
• RAC Sample Code Page
http://www.oracle.com/technology/sample_code/products/rac/index.html
30
31

Rac Internals

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Rac Internals

Enviado por

Direitos autorais:

Formatos disponíveis

<Insert Picture Here>

Understanding RAC Internals

The following is intended to outline our general

<Insert Picture Here>

What are the major

Operating System Operating System

RAW Devices OCR and Voting Disks

What does Clusterware provide?

<Insert Picture Here>

Why does Oracle

• Cluster needs to know who is a member at all times

• Split Brain Resolution:

• IT IS NOT SUPPORTED TO REDUCE MISSCOUNT

<Insert Picture Here>

How does Oracle handle

• Network between the nodes of a RAC cluster MUST

• Bandwidth requirements depend on

ADDR INDX INST_ID P PICK NAME_KSXPIA IP_KSXPIA

• cluster_interconnects (init.ora for RAC)

Operating System Dependency

• Block access latencies increase when CPU(s) busy

“lost blocks” at the RDBMS level, responsible for

“Lost Blocks”: NIC Receive Errors

Finding a Problem with the

Should never be here

What are the

Node Startup Sequence

Operating System Operating System

RAW Devices OCR and Voting Disks

Operating System Operating System

RAW Devices OCR and Voting Disks

Operating System Operating System

RAW Devices OCR and Voting Disks

Operating System Operating System

RAW Devices OCR and Voting Disks

What exactly is the VIP,

Why Oracle RAC 10g has a VIP?

• Protects database clients from long TCP/IP timeouts

Oracle RAC VIP is DIFFERENT

• Only accepts connections when on its home node

• Failure on home node: relocates to another node in

[root@pmrac1 root]# ifconfig

eth0:1 Link encap:Ethernet HWaddr 00:12:79:D8:90:93

• New resource in Oracle RAC 10g Release 2

Creating an Application VIP

• The usrvip script must run as root

What is the purpose of

Oracle Notification Service (ONS)

• Publish/Subscribe Messaging System

• Fast Application Notification (FAN) is a RAC

• HA Events: JDBC Implicit Connection Cache, OCI,

• Load Balancing Advisory Events: JDBC Implicit

How does Oracle do load

Connection Load Balancing

Application Connection Pool

• Load Balancing Advisory is an advisory for balancing

Load Balancing Advisory

• Automatic Workload Repository

Runtime Connection Load Balancing

• When application does “getConnection”, the

Fast Connection Failover

• Fast and reliable high availability for connections in an