Você está na página 1de 13

DRBD trouble-shooting Primary/Unknown problem

[root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--ns:272314733 nr:15766224 dw:88314479 dr:218442557 al:24340 bm:12960 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208 act_log: used:0/127 hits:23841424 misses:24646 starving:0 dirty:306 changed:24340 [root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--ns:15766224 nr:272314733 dw:288180421 dr:6902987 al:4734 bm:13160 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208 act_log: used:0/127 hits:3961688 misses:4941 starving:0 dirty:207 changed:4734 I assigned db2 to be primary DB server a week ago for monitoring. This means DB2 was Primary at failure time. I have stopped heartbeat and drbd on db1. [root@db1 ~]# chkconfig --list | grep heartbeat heartbeat 0:off 1:off 2:on 3:on 4:on 5:on 6:off [root@db1 ~]# chkconfig --list | grep drbd drbd 0:off 1:off 2:off 3:on 4:on 5:on 6:off chkconfig --level 35 drbd off chkconfig --level 35 heartbeat off

Rescue operation started at 18:55 PST, 14 Mar 2008 . The following is the status. webmail servers are stopped.

DB1 and DB2 are turned ON. DB2 was prinmary before failure. Thus heartbeat and DRBD services are stopped on DB1. /etc/ha.d/ha.cf updated on both servers to add eth0 as bcast interface as well. The dead time is changed from 10 seconds to 30 seconds. The /etc/ha.d/ha.cf now looks like this: logfacility local0 keepalive 2 deadtime 30 # USE THIS!!! #deadtime 10 bcast eth0 bcast eth1 #serial /dev/ttyS0 #baud 19200 auto_failback off respawn hacluster /usr/lib/heartbeat/ipfail node db1.example.com db2.example.com After these changes, heartbeat and DRBD are kept OFF at DB1. And heartbeat (not DRBD) is re-started on DB2. The status of DRBD on DB2 is :[root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--ns:15766224 nr:272314733 dw:288877845 dr:23063634 al:4799 bm:13495 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208 act_log: used:0/127 hits:4135979 misses:5006 starving:0 dirty:207 changed:4799 I have to start DRBD on DB1 manually :[root@db1 ~]# modprobe drbd [root@db1 ~]# lsmod | grep -i drbd drbd 181896 0

[root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 [root@db1 ~]# DB2 still is in standalone mode:[root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--ns:15766224 nr:272314733 dw:288891985 dr:23063646 al:4799 bm:13495 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208 act_log: used:0/127 hits:4139514 misses:5006 starving:0 dirty:207 changed:4799 Back on DB1 :[root@db1 ~]# drbdadm up all [root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r--ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 [root@db1 ~]# Notice the status of DRBD on DB1 as "WFConnection" => Waiting For Connection While the status is same on DB2:[root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--ns:15766224 nr:272314733 dw:288891985 dr:23063646 al:4799 bm:13495 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208

act_log: used:0/127 hits:4139514 misses:5006 starving:0 dirty:207 changed:4799

You cannot run drbdadm on this node. But still, if you will run, you will get:[root@db2 ~]# drbdadm up all Failure: (124) Device is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0 disk /dev/sda6 /dev/sda5 flexible --set-defaults --create-device' terminated with exit code 10 [root@db2 ~]# Since we don't need the disk to be actually mounted while we do this. Nor we want any other service interfering with our /data, lets shutdown heartbeat (not drbd) on DB2. This will automatically unmount /data as well. [root@db2 ~]# service heartbeat stop So after this step, all nfs, mysql, etc services are stopped and /data is unmounted. [root@db2 ~]# mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) nfsd on /proc/fs/nfsd type nfsd (rw) The status of both server is as follows:heartbeat and DRBD stopped on DB1. heartbeat stopped on DB2. DRBD running on DB2. /data unmounted on both servers. At this time, both servers have acquired the status of Secondary/Unknown.

Let's try connecting both servers:drbdadm -- connect all first on DB2 (the current Primary server) [root@db2 ~]# drbdadm -- connect all [root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown r--ns:0 nr:0 dw:288894573 dr:23063650 al:4799 bm:13585 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208 act_log: used:0/127 hits:4140161 misses:5006 starving:0 dirty:207 changed:4799 Doesn't help. It is still StandAlone. [root@db1 ~]# drbdadm -- connect all [root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r--ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 [root@db1 ~]# Doesn't help. It is still waiting for connection. There is no other way but to make this system primary by force: Perfrom this step ONLY on DB2 (which is [supposed to be] the current primary) (This will start the sync process and may take hours, depending on the IO speed of your setup):-

drbdadm -- --overwrite-data-of-peer primary all [root@db2 ~]# drbdadm -- --overwrite-data-of-peer primary all This statement has made this Primary, alright. But still the other side is not connected. [root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--ns:0 nr:0 dw:288894573 dr:23063650 al:4799 bm:13585 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208 act_log: used:0/127 hits:4140161 misses:5006 starving:0 dirty:207 changed:4799 [root@db2 ~]# [root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r--ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0

Here is what I see in log at DB1 aftger I issue "connect all" or "overwrite data" commands: Mar 14 07:36:15 db1 kernel: drbd0: conn( Unconnected -> WFConnection ) Mar 14 07:40:12 db1 kernel: drbd0: conn( WFConnection -> WFReportParams ) Mar 14 07:40:12 db1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 Mar 14 07:40:12 db1 kernel: drbd0: Split-Brain detected, dropping connection! Mar 14 07:40:12 db1 kernel: drbd0: self 35B46A95F04C53C0:01028196C77A0C52:3D69CB5573788ED9:000000000000 0004 Mar 14 07:40:12 db1 kernel: drbd0: peer 1DB708067D86D985:01028196C77A0C53:3D69CB5573788ED8:000000000000 0004 Mar 14 07:40:12 db1 kernel: drbd0: conn( WFReportParams -> Disconnecting ) Mar 14 07:40:12 db1 kernel: drbd0: error receiving ReportState, l: 4!

Mar 14 07:40:12 db1 kernel: drbd0: asender terminated Mar 14 07:40:12 db1 kernel: drbd0: tl_clear() Mar 14 07:40:12 db1 kernel: drbd0: Connection closed Mar 14 07:40:12 db1 kernel: drbd0: conn( Disconnecting -> StandAlone ) Mar 14 07:40:12 db1 kernel: drbd0: receiver terminated ============================= Help Start =============== helpful link: http://lists.linbit.com/pipermail/drbd-user/2007-June/006950.html Split-Brains are mainly caused by a loss of communication between the two nodes, and the Secondary node becoming Primary, while the other node remains in Primary state (both nodes think their peer is dead, so they become both Primary). Check your syslog to see what happened before the Split-Brain occurred (eth0 link down, etc.) 2. how to correct the situation ? You have to decide which node you want to "sacrifice", and tell him to discard his data. Run the following command on the node on which you want to discard the data : root at bad-data# drbdadm -- --discard-my-data connect all Then, simply connect the other machine : root at good-data# drbdadm connect all The nodes will start to resynchronize, transferring data from the good-data node (will appears as SyncSource in /proc/drbd) to the bad-data node (SyncTarget). ================================== Help END =================== My logs prove the same problem. The connection / Handshake is successful. Both nodes identify each other as valid nodes with unique IDs, but because of split braing, they are not syncing. One has to loose data. And I want DB1 to loose data.

So before I move on, heartbeat on both servers is again stopped to minimize any impact. [root@db1 ~]# service heartbeat stop [root@db2 ~]# service heartbeat stop I make DB2 primary before moving on: [root@db2 ~]# drbdadm -- --overwrite-data-of-peer primary all

Lets make DB1 loose it's data. This is bad-data node. NOTE: ALARM : WARNING. Only issue the following command on bad-data node. I have DB1 as bad-data node. [root@db1 ~]# drbdadm -- --discard-my-data connect all After issuing this command on DB1, I see the status as follows on both servers: [root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:WFConnection st:Secondary/Unknown ds:UpToDate/DUnknown C r--ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 Still the old status. [root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:StandAlone st:Primary/Unknown ds:UpToDate/DUnknown r--ns:0 nr:0 dw:288900033 dr:23068879 al:4799 bm:13675 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12488370 misses:12208 starving:0 dirty:0 changed:12208

act_log: used:0/127 hits:4141526 misses:5006 starving:0 dirty:207 changed:4799 Still the old status. Now issue the connect all on good-data node, which is DB2. [root@db2 ~]# drbdadm connect all Immediately, I see the following on DB2 :[root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r--ns:51552 nr:0 dw:288900033 dr:23120431 al:4799 bm:13678 lo:0 pe:0 ua:0 ap:0 [======>.............] sync'ed: 33.4% (107572/159124)K finish: 0:00:08 speed: 12,888 (12,888) K/sec resync: used:0/31 hits:12491638 misses:12212 starving:0 dirty:0 changed:12212 act_log: used:0/127 hits:4141526 misses:5006 starving:0 dirty:207 changed:4799 [root@db2 ~]# Alhumdulillah. Let's check DB1. It would have finished syncing by now. [root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--ns:0 nr:159124 dw:159124 dr:0 al:0 bm:129 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:11281 misses:129 starving:0 dirty:0 changed:129 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 You have new mail in /var/spool/mail/root [root@db1 ~]# Alhumdulillah !!!!! So now here is the final status of both DB servers:

DB1, which was supposed to be Secondary:[root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--ns:0 nr:159124 dw:159124 dr:0 al:0 bm:129 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:11281 misses:129 starving:0 dirty:0 changed:129 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 And DB2, which was supposed to be Primary :[root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--ns:159124 nr:0 dw:288900033 dr:23228003 al:4799 bm:13804 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12499650 misses:12338 starving:0 dirty:0 changed:12338 act_log: used:0/127 hits:4141526 misses:5006 starving:0 dirty:207 changed:4799

And the log files from both servers show the following , after I issued the last "connect all" command after the "discard data" command. DB1: /var/log/messages :-------------------------Mar 14 08:01:18 db1 kernel: drbd0: conn( WFConnection -> WFReportParams ) Mar 14 08:01:18 db1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 Mar 14 08:01:18 db1 kernel: drbd0: Split-Brain detected, manually solved. Sync from peer node Mar 14 08:01:18 db1 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Mar 14 08:01:18 db1 kernel: drbd0: Writing meta data super block now. Mar 14 08:01:18 db1 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID ) Mar 14 08:01:18 db1 kernel: drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Mar 14 08:01:18 db1 kernel: drbd0: Began resync as SyncTarget (will sync 159124 KB [39781 bits set]).

Mar 14 08:01:18 db1 kernel: drbd0: Writing meta data super block now. Mar 14 08:01:34 db1 kernel: drbd0: Resync done (total 16 sec; paused 0 sec; 9944 K/sec) Mar 14 08:01:34 db1 kernel: drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Mar 14 08:01:34 db1 kernel: drbd0: Writing meta data super block now. DB2: /var/log/messages :-------------------------Mar 14 08:01:18 db2 kernel: drbd0: conn( StandAlone -> Unconnected ) Mar 14 08:01:18 db2 kernel: drbd0: receiver (re)started Mar 14 08:01:18 db2 kernel: drbd0: conn( Unconnected -> WFConnection ) Mar 14 08:01:18 db2 kernel: drbd0: conn( WFConnection -> WFReportParams ) Mar 14 08:01:18 db2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 Mar 14 08:01:18 db2 kernel: drbd0: Split-Brain detected, manually solved. Sync from this node Mar 14 08:01:18 db2 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Mar 14 08:01:18 db2 kernel: drbd0: Writing meta data super block now. Mar 14 08:01:18 db2 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Mar 14 08:01:18 db2 kernel: drbd0: Began resync as SyncSource (will sync 159124 KB [39781 bits set]). Mar 14 08:01:18 db2 kernel: drbd0: Writing meta data super block now. Mar 14 08:01:34 db2 kernel: drbd0: Resync done (total 16 sec; paused 0 sec; 9944 K/sec) Mar 14 08:01:34 db2 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Mar 14 08:01:34 db2 kernel: drbd0: Writing meta data super block now. ====================== Great. Alhumdulillah. Now is the time to start heartbeat and DRBD services on both nodes and also configure them to start at system boot. Recall that I disabled the heartbeat and drbd services on DB1 (Secondary), so in case if it reboots (Accidently), it doesn't halt on the famous "[yes]" question at terminal, waiting for ever. As I could not have done anything as I am here in Pakistan and the servers are located in US. And the service provider's technical support is as shi**y as it can get. (Sorry to say).

Remember to start heartbeat service on DB2 (Primary) first. So here it goes. FIRST on DB2:[root@db2 ~]# service heartbeat start Starting High-Availability services: [ OK ] [root@db2 ~]# service drbd start Starting DRBD resources: [ ]. [root@db2 ~]# chkconfig --level 35 heartbeat on [root@db2 ~]# chkconfig --level 35 drbd on

THEN on DB1:[root@db1 ~]# service heartbeat start Starting High-Availability services: [ OK ] [root@db1 ~]# service drbd start Starting DRBD resources: [ ]. [root@db1 ~]# chkconfig --level 35 heartbeat on [root@db1 ~]# chkconfig --level 35 drbd on ============================================================= ================= Just for the peace of mind, here is the status of DRBD on both servers one more time: [root@db2 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--ns:161868 nr:0 dw:288902777 dr:23233232 al:4799 bm:13804 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:12499650 misses:12338 starving:0 dirty:0 changed:12338

act_log: used:0/127 hits:4142212 misses:5006 starving:0 dirty:207 changed:4799 [root@db1 ~]# cat /proc/drbd version: 8.0.6 (api:86/proto:86) SVN Revision: 3048 build by buildsvn@c5-i386-build, 2007-12-08 01:17:51 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--ns:0 nr:161868 dw:161868 dr:0 al:0 bm:129 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:11281 misses:129 starving:0 dirty:0 changed:129 act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0 ============================================================= ================================== Alhumdulillah. Time to start both webmail servers too. (It has been too long for them to sit idle, isn't it ? ! )

Você também pode gostar