Você está na página 1de 4

Top 4 Reasons for Node Reboot or Node Eviction in Real Application Cluster

(RAC) Environment

Remote DBA usually face Node Reboots or Node Evictions in Real Application
Cluster Environment. Node Reboot is performed by CRS to maintain consistency in
Cluster environment by removing node which is facing some critical issue.

A critical problem could be a node not responding via a network heartbeat , a node not
responding via a disk heartbeat , a hung, or a hung ocssd.bin process etc. There could be
many more reasons for node Eviction but Some of them are common and repetitive.
Here, I am listing: -

Whenever, Database Administrator face Node Reboot issue, First thing to look at
/var/log/message and OS Watcher logs of the Database Node which was rebooted.

var/log/messages will give you an actual picture of reboot:- Exact time of restart,
status of resource like swap and RAM etc.

1. High Load on Database Server: Out of 100 Issues, I have seen 70 to 80-time
High load on the system was reason for Node Evictions, one common scenario is
due to high load RAM and SWAP space of DB node got exhaust and system stops working
and finally reboot.

So, Every time you see a node eviction start investigation with /var/log/messages and
Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due
to high load.

/var/log/messages output from a Database Node just before Node eviction:

Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA: 2*4kB 1*8kB 0*16kB 1*32kB 2*64kB 0*128kB
1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15792kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA32: 150*4kB 277*8kB 229*16kB 186*32kB 51*64kB
65*128kB 82*256kB 13*512kB 3*1024kB 3*2048kB 78*4096kB = 380368kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 Normal: 12362*4kB 58*8kB 0*16kB 0*32kB 0*64kB
0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 52984kB
Apr 23 08:15:09 remotedb06 kernel: 83907 total pagecache pages
Apr 23 08:15:11 remotedb06 kernel: 39826 pages in swap cache
Apr 23 08:15:11 remotedb06 kernel: Swap cache stats: add 30820220, delete 30780387, find
18044378/19694662
Apr 23 08:15:14 remotedb06 kernel: Free swap = 4kB
Apr 23 08:15:15 remotedb06 kernel: Total swap = 25165816kB
Apr 23 08:15:16 remotedb06 kernel: 25165808 pages RAM
Apr 23 08:15:28 remotedb06 kernel: 400673 pages reserved
Apr 23 08:15:30 remotedb06 kernel: 77691135 pages shared
Apr 23 08:15:31 remotedb06 kernel: 9226743 pages non-shared
Apr 23 08:15:33 remotedb06 kernel: osysmond.bin: page allocation failure. order:4,
mode:0xd0

From above message, we can see that this system has only 4kB free swap out of 24G
swap space. This means system neither has RAM not SWAP for processing, which case a
reboot. This picture is also clear from OS Watcher of system.

08:15:29 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s

08:15:29 all 0.67 0.00 82.60 16.60 0.00 0.06 0.00 0.07 42521.98
08:15:35 all 0.84 0.00 23.40 73.19 0.00 0.11 0.00 2.47 35859.59
08:15:39 all 1.22 0.00 85.13 13.47 0.00 0.13 0.00 0.04 40569.31
08:15:45 all 1.57 0.00 98.22 0.13 0.00 0.08 0.00 0.00 36584.31
08:15:50 all 1.41 0.00 98.48 0.04 0.00 0.07 0.00 0.00 36643.10
08:15:54 all 0.84 0.00 99.09 0.03 0.00 0.05 0.00 0.00 36257.02
08:16:06 all 0.95 0.00 98.88 0.09 0.00 0.08 0.00 0.00 39113.15
08:16:11 all 0.87 0.00 99.00 0.07 0.00 0.06 0.00 0.00 37490.22
08:16:16 all 0.89 0.00 98.97 0.07 0.00 0.07 0.00 0.00 37681.04
08:16:22 all 0.78 0.00 99.12 0.05 0.00 0.05 0.00 0.00 36963.75
08:16:38 all 0.79 0.00 98.86 0.28 0.00 0.08 0.00 0.00 36639.21
08:16:43 all 0.78 0.00 98.79 0.34 0.00 0.08 0.00 0.01 37405.99
08:16:54 all 1.06 0.00 98.71 0.12 0.00 0.11 0.00 0.00 38102.37
08:17:08 all 1.69 0.00 67.02 30.93 0.00 0.06 0.00 0.29 37316.55

How to avoid Node Reboot due to High Load?


The simple and best way to avoid this is use Oracle Database Resource Manager
(DBRM). DBRM help to resolve this by allowing the database to have more control over
how hardware resources and their allocation. DBA should setup Resource consumer
group and Resource plan and should use them as per requirements. In Exadata
system Exadata DBA can use IORM to setup resource allocation among multiple
database instances.

2. Voting Disk Not Reachable: One of the another reason for Node Reboot is
Clusterware is not able to access a minimum number of the voting files. When the node
aborts for this reason, the node alert log will show CRS-1606 error.

Here is a scenario for voting disk not reachable:

2013-01-26 10:15:47.177
[cssd(3743)]CRS-1606:The number of voting files available, 1, is less than the minimum
number of voting files required, 2, resulting in CSSD
termination to ensure data integrity; details at (: CSSNM00018:) in
/u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log

There could be two reasons for this issue:

1. Connection to the voting disk is interrupted.


2. If only one voting disk is in use and version is less than 11.2.0.3.4, hitting known
bug 13869978.

How to Solve Voting Disk Outage ?


There could be many reasons for voting disk is not reachable, Here are few general
approach for DBA to follow.
1. Use command "crsctl query css votedisk" on a node where Clusterware is up to
get a list of all the voting files.
2. Check that each node can access the devices underlying each voting file.
3. Check for permissions to each voting file/disk have not been changed.
4. Check OS, SAN, and storage logs for any errors from the time of the incident.
5. Apply fix for 13869978 if only one voting disk is in use. This is fixed in 11.2.0.3.4
patch set and above, and 11.2.0.4 and above

If any voting files or underlying devices are not currently accessible from any node, work
with storage administrator and/or system administrator to resolve it at storage and/or OS
level.

3. Missed Network Connection between Nodes: In technical term this is called


as Missed Network Heartbeat (NHB). Whenever there is communication gap or no
communication between nodes on private network (interconnect) due to
network outage or some other reason. A node abort itself to avoid "split brain"
situation. The most common (but not exclusive) cause of missed NHB is network
problems communicating over the private interconnect.

Suggestion to troubleshoot Missed Network Heartbeat.


1. Check OS statistics from the evicted node from the time of the eviction. DBA
can use OS Watcher to look at OS Stats at time of issue, check oswnetstat
and oswprvtnet for network related issues.
2. Validate the interconnect network setup with the Help of Network
administrator.
3. Check communication over the private network.
4. Check that the OS network settings are correct by running the RAC check tool.

4. Database or ASM Instance Hang: Sometimes Database or ASM instance hang


can cause Node reboot. In these case Database instance is hang and is terminated
afterwards, which cause either reboot cluster or Node eviction. DBA should check alert
log of Database and ASM instance for any hang situation which might cause this issue.

Database Alert log file entry for Database Hang Situation:


Wed Jun 05 03:25:01 2013
Sweep [inc][122970]: completed
System State dumped to trace file
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122970/remotedb2_m005_118200_
i122970_a.trc
Wed Jun 05 03:25:33 2013
DSKM process appears to be hung. Initiating system state dump.
Wed Jun 05 03:25:33 2013
System state dump requested by (instance=2, osid=9757 (GEN0)), summary= [system state dump
request (ksz_check_ds)].
System State dumped to trace file
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc
Wed Jun 05 03:27:04 2013
Errors in file
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_pmon_9707.trc (incident=12289
7):
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x143642090], [3], [], [], [],
[], [], [], [], [], []
Incident details in:
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/incident/incdir_122897/remotedb2_pmon_9707_i1
22897.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_pmon_9707.trc:
ORA-00600: internal error code, arguments: [ksz_cln_proc1], [0x143642090], [3], [], [], [],
[], [], [], [], [], []
PMON (ospid: 9707): terminating the instance due to error 472
Wed Jun 05 03:27:05 2013
System state dump requested by (instance=2, osid=9707 (PMON)), summary= [abnormal instance
termination].
System State dumped to trace file
/u01/app/oracle/diag/rdbms/remotedb/remotedb2/trace/remotedb2_diag_9759.trc
Instance terminated by PMON, pid = 9707
Wed Jun 05 03:45:58 2013
Starting ORACLE instance (normal)

At the same time resources at Cluster level start failing and node was evicted by itself.
Real Application Cluster Log files

[/u01/app/11.2.0.3/grid/bin/oraagent.bin (96437)] CRS-5011: Check of resource "PCAB1DE"


failed: details at "(: CLSN00007:)" in
"/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2013-06-05 03:27:06.649
[/u01/app/11.2.0.3/grid/bin/oraagent.bin (96437)] CRS-5011: Check of resource "P1PANIE"
failed: details at "(: CLSN00007:)" in
"/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2013-06-05 03:32:41.360
[/u01/app/11.2.0.3/grid/bin/oraagent.bin (96437)] CRS-5011: Check of resource "P1ESXPE"
failed: details at "(: CLSN00007:)" in
"/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2013-06-05 03:36:25.675
[cssd (67762)] CRS-1662: Member kill requested by node exs2db03 for member number 1, group
DBPCAB1DE
2013-06-05 03:36:58.775
[/u01/app/11.2.0.3/grid/bin/oraagent.bin (96437)] CRS-5818: Aborted command 'clean' for
resource 'ora. pcab1de.db'. Details at (: CRSAGF00113:) {0:6:28} in
/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2013-06-05 03:37:06.784
[/u01/app/11.2.0.3/grid/bin/oraagent.bin (96437)] CRS-5818: Aborted command 'clean' for
resource 'ora.p1panie.db'. Details at (: CRSAGF00113:) {0:6:29} in
/u01/app/11.2.0.3/grid/log/exs2db04/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2013-06-05 03:37:26.307
[cssd(67762)]CRS-1608:This node was evicted by node 3, exs2db03; details at (:CSSNM00005:)
in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log.
2013-06-05 03:37:26.318
[cssd (67762)] CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:
CSSSC00012:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log
2013-06-05 03:37:26.318
[cssd (67762)] CRS-1652: Starting cleanup of CRSD resources.
2013-06-05 03:37:26.340
[cssd (67762)] CRS-1608: This node was evicted by node 3, exs2db03; details at (:
CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log.
2013-06-05 03:37:26.450
[cssd (67762)] CRS-1608: This node was evicted by node 3, exs2db03; details at (:
CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log.
2013-06-05 03:37:26.647
[cssd (67762)] CRS-1608: This node was evicted by node 3, exs2db03; details at (:
CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log.
2013-06-05 03:37:26.869
[cssd (67762)] CRS-1608: This node was evicted by node 3, exs2db03; details at (:
CSSNM00005:) in /u01/app/11.2.0.3/grid/log/exs2db04/cssd/ocssd.log.
2013-06-05 03:37:26.921The main reason for this issue was "DSKM process appears to be
hung.", but definately this was not because of high load Since OS Watcher at this time
doesn't show any load.

So, I believe this could be due to some bug in database.

In few of the cases, bugs could be the reason for node reboot, bug may be at Database
level, ASM level or at Real Application Cluster level. Here, after initial investigation from
Database Administrator side, DBA should open an SR with Oracle Support.

Você também pode gostar