Escolar Documentos
Profissional Documentos
Cultura Documentos
(RAC) Environment
Remote DBA usually face Node Reboots or Node Evictions in Real Application
Cluster Environment. Node Reboot is performed by CRS to maintain consistency in
Cluster environment by removing node which is facing some critical issue.
A critical problem could be a node not responding via a network heartbeat , a node not
responding via a disk heartbeat , a hung, or a hung ocssd.bin process etc. There could be
many more reasons for node Eviction but Some of them are common and repetitive.
Here, I am listing: -
Whenever, Database Administrator face Node Reboot issue, First thing to look at
/var/log/message and OS Watcher logs of the Database Node which was rebooted.
var/log/messages will give you an actual picture of reboot:- Exact time of restart,
status of resource like swap and RAM etc.
1. High Load on Database Server: Out of 100 Issues, I have seen 70 to 80-time
High load on the system was reason for Node Evictions, one common scenario is
due to high load RAM and SWAP space of DB node got exhaust and system stops working
and finally reboot.
So, Every time you see a node eviction start investigation with /var/log/messages and
Analyze OS Watcher logs. Below is a situation when a Database Node was reboot due
to high load.
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA: 2*4kB 1*8kB 0*16kB 1*32kB 2*64kB 0*128kB
1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15792kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 DMA32: 150*4kB 277*8kB 229*16kB 186*32kB 51*64kB
65*128kB 82*256kB 13*512kB 3*1024kB 3*2048kB 78*4096kB = 380368kB
Apr 23 08:15:04 remotedb06 kernel: Node 0 Normal: 12362*4kB 58*8kB 0*16kB 0*32kB 0*64kB
0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 52984kB
Apr 23 08:15:09 remotedb06 kernel: 83907 total pagecache pages
Apr 23 08:15:11 remotedb06 kernel: 39826 pages in swap cache
Apr 23 08:15:11 remotedb06 kernel: Swap cache stats: add 30820220, delete 30780387, find
18044378/19694662
Apr 23 08:15:14 remotedb06 kernel: Free swap = 4kB
Apr 23 08:15:15 remotedb06 kernel: Total swap = 25165816kB
Apr 23 08:15:16 remotedb06 kernel: 25165808 pages RAM
Apr 23 08:15:28 remotedb06 kernel: 400673 pages reserved
Apr 23 08:15:30 remotedb06 kernel: 77691135 pages shared
Apr 23 08:15:31 remotedb06 kernel: 9226743 pages non-shared
Apr 23 08:15:33 remotedb06 kernel: osysmond.bin: page allocation failure. order:4,
mode:0xd0
From above message, we can see that this system has only 4kB free swap out of 24G
swap space. This means system neither has RAM not SWAP for processing, which case a
reboot. This picture is also clear from OS Watcher of system.
08:15:29 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
08:15:29 all 0.67 0.00 82.60 16.60 0.00 0.06 0.00 0.07 42521.98
08:15:35 all 0.84 0.00 23.40 73.19 0.00 0.11 0.00 2.47 35859.59
08:15:39 all 1.22 0.00 85.13 13.47 0.00 0.13 0.00 0.04 40569.31
08:15:45 all 1.57 0.00 98.22 0.13 0.00 0.08 0.00 0.00 36584.31
08:15:50 all 1.41 0.00 98.48 0.04 0.00 0.07 0.00 0.00 36643.10
08:15:54 all 0.84 0.00 99.09 0.03 0.00 0.05 0.00 0.00 36257.02
08:16:06 all 0.95 0.00 98.88 0.09 0.00 0.08 0.00 0.00 39113.15
08:16:11 all 0.87 0.00 99.00 0.07 0.00 0.06 0.00 0.00 37490.22
08:16:16 all 0.89 0.00 98.97 0.07 0.00 0.07 0.00 0.00 37681.04
08:16:22 all 0.78 0.00 99.12 0.05 0.00 0.05 0.00 0.00 36963.75
08:16:38 all 0.79 0.00 98.86 0.28 0.00 0.08 0.00 0.00 36639.21
08:16:43 all 0.78 0.00 98.79 0.34 0.00 0.08 0.00 0.01 37405.99
08:16:54 all 1.06 0.00 98.71 0.12 0.00 0.11 0.00 0.00 38102.37
08:17:08 all 1.69 0.00 67.02 30.93 0.00 0.06 0.00 0.29 37316.55
2. Voting Disk Not Reachable: One of the another reason for Node Reboot is
Clusterware is not able to access a minimum number of the voting files. When the node
aborts for this reason, the node alert log will show CRS-1606 error.
2013-01-26 10:15:47.177
[cssd(3743)]CRS-1606:The number of voting files available, 1, is less than the minimum
number of voting files required, 2, resulting in CSSD
termination to ensure data integrity; details at (: CSSNM00018:) in
/u01/app/11.2.0/grid/log/apdbc76n1/cssd/ocssd.log
If any voting files or underlying devices are not currently accessible from any node, work
with storage administrator and/or system administrator to resolve it at storage and/or OS
level.
At the same time resources at Cluster level start failing and node was evicted by itself.
Real Application Cluster Log files
In few of the cases, bugs could be the reason for node reboot, bug may be at Database
level, ASM level or at Real Application Cluster level. Here, after initial investigation from
Database Administrator side, DBA should open an SR with Oracle Support.