Oracle Grid Infrastructure 11.2.0.2 has many features including Cluster Node Membership, Cluster Resource Management, and Cluster Resources monitoring. One of the key areas where DBA need to have expert knowledge on how the cluster node membership works and how the cluster decides to take out node should there be a heartbeat network, voting disk or node specific issues. In Oracle 11.2.0.2  oracle bring many new features  one of them is reboot less fencing :

But what happened before that ????

when sub-components of Oracle RAC like private interconnect or voting disk fails, Oracle Clusterware tries to prevent a split-brain with a fast reboot of the node without waiting for I/O operation or synchronization of the file systems.

To read Split Brain in Rac: Split Brain and Amnesia in Oracle RAC

Oracle uses algorithms common to STONITH (shoot the other node in the Head) implementations to determine what nodes need to get fenced. When a node is alerted that it is being “fenced” it uses suicide to carry out the order

STONITH automatically powers down a node that is not working correctly. An administrator might employ STONITH if one of the nodes in a cluster can not be reached by the other node(s) in the cluster.;

But after 11.2.0.2 the mechanism is changed.

Finally, Oracle has improved the node fencing in Oracle 11g Release 2 (11.2.0.2) by killing the processes on the failed node that are capable of performing IO and then stopping the Clusterware on the failed node rather than simply rebooting the failed node. Whenever subcomponents of Oracle RAC like private interconnect, voting disk etc fails, Oracle Clusterware first decides which node to evict,  then

  1. The Clusterware attempts to shut down all Oracle resources and process on that node, especially those processes which generates I/O.
  2. The Clusterware will stop cluster service on that node.
  3. Then OHASD[Oracle High Availability Service Daemon] will try to start CRS [Cluster Ready Service] stack again. and once the interconnect is back online, all cluster resources on that node will automatically be started.
  4. And if it is not possible to stop resources or processes generating I/O then Clusterware will kill the node.

Thank you for giving your valuable time to read the above information.

If you want to be updated with all our articles send us the Invitation or Follow us:

Skant Gupta’s LinkedIn: www.linkedin.com/in/skantali/

Joel Perez’s LinkedIn: Joel Perez’s Profile

Anuradha’s LinkedIn: Anuradha’s Profile

LinkedIn Group: Oracle Cloud DBAAS

Facebook Page: OracleHelp

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.