博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)
阅读量:7027 次
发布时间:2019-06-28

本文共 5991 字,大约阅读时间需要 19 分钟。

In this Document

 
 
 
 
 
 
 

 

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.2 and later

Information in this document applies to any platform.

PURPOSE

This note is to explain what is split brain in an Oracle Real Application cluster and what errors/consequences are associated with it.

SCOPE

For DBA and Support engineer.

DETAILS

In generic term, split-brain indicates data inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and unifying their data to each other.

There are two components in Oracle Real Application Cluster implementation could experience split brain.

1. Clusterware layer

Cluster nodes maintain their heartbeat via private network and voting disk. When there is a private network disruption, cluster nodes can not communicate to each other via private network for the time period of misscount setting, split brain will happen. In such case, voting disk will be used to determine which node(s) survive and which node(s) will be evicted. The common voting result will be:

a. The group with more cluster nodes survive

b. The group with lower node member in case of same number of node(s) available in each group
c. Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.

Commonly, one will see messages similar to the followings in ocssd.log when split brain happens:

[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: ###################################[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting###################################

Above messages indicate the communication from node 2 to node 1 is not working, hence node 2 only sees 1 node, but node 1 is working fine and it can see two nodes in the cluster. To avoid splitbrain, node 2 aborted itself.

Solution: Please engage network administrator to check private network layer to eliminate any network fault.

2. Real Application Cluster (database) layer

To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like LMON, LMD, LMS and LCK. Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain. Controlfile is used similarly to voting disk in clusterware layer to determine which instance(s) survive and which instance(s) evict. The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.

Common messages in instance alert log are similar to:

alert log of instance 1:
---------
Mon Dec 07 19:43:05 2011
IPC Send timeout detected.Sender: ospid 26318
Receiver: inst 2 binc 554466600 ospid 29940
IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20
Mon Dec 07 19:43:07 2011
Communications reconfiguration: instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for clusterware split-brain resolution
Mon Dec 07 19:53:07 2011
Evicting instance 2 from cluster
Waiting for instances to leave: 
...
alert log of instance 2:
---------
Mon Dec 07 19:42:18 2011
IPC Send timeout detected. Receiver ospid 29940
Mon Dec 07 19:42:18 2011
Errors in file 
/u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:
Trace dumping is performing id=[cdmp_20091207194307]
Mon Dec 07 19:42:20 2011
Waiting for clusterware split-brain resolution
Mon Dec 07 19:44:45 2011
ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1
Mon Dec 07 19:44:51 2011
ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1
Mon Dec 07 19:45:38 2011
ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1
Mon Dec 07 19:52:27 2011
Errors in file 
/u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc  
(incident=90153):
ORA-29740: evicted by member 0, group incarnation 10
Incident details in: 
/u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc

In above example, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:

a. Network problem

b. Process hang
c. Bug etc

Please see Top 5 issues for Instance Eviction  for more information.

In case of instance eviction, alert log and all background traces need to be checked to determine the root cause.

Known Issues

1. Bug 7653579 - IPC send timeout in RAC after only short period 

    Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC 
    Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows

2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load

    Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 
    Fixed in: 11.2.0.1

3. Bug 8365141 - DRM quiesce step hang causes instance eviction 

    Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.1

4. Bug 7587008 - Hung RAC instance not evicted from cluster 

    Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release

5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits 

    Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows

6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2

    BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL
    BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000
    Refer: 11gR2: LMON received an instance eviction notification from instance n 
    Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3

转载地址:http://cjrxl.baihongyu.com/

你可能感兴趣的文章
java中23种设计模式之6-适配器模式(adapter pattern)
查看>>
Easy C 编程 in Linux
查看>>
poj3761(反序表)
查看>>
x86寄存器总结
查看>>
jquery easyui ajax data属性传值方式
查看>>
封装了些文件相关的操作
查看>>
什么是Solr
查看>>
poj2386(简单dfs)
查看>>
双链表的基本操作
查看>>
走进异步编程的世界 - 剖析异步方法(上)
查看>>
[HAOI2006]受欢迎的牛
查看>>
docker-maven-plugin 完全免Dockerfile 文件
查看>>
day20 Python 装饰器
查看>>
限制性与非限制性定语从句区别
查看>>
fiddler工具的使用
查看>>
jquery源码分析(二)——架构设计
查看>>
javascript深入理解js闭包(转)
查看>>
207. Course Schedule
查看>>
如何优化您的 Android 应用 (Go 版)
查看>>
Trie树实现
查看>>