Clusterware Testing Failures

[Test Code]
Action Target
PRV-Network-1 Preconditions:
Initiate all Workloads (esp. those that flood the private interconnect
used by RAC cache fusion and CSS heartbeats, and OS stress)
Remove SINGLE Primary private
network cable
Identify Vendor, CSS and CRS master nodes
CSS master node Note: Since CSS only supports one physical interface in Pre-
11.2.0.2 version, we need network interface teaming/bonding on the
private interconnects to accomplish this.
The interconnect should be bonded
Steps:
1- Physically remove the Primary private network cable from CSS
master/the vendor clusterware master
2- Wait 600 seconds
3- Restore Primary network cable
4- Remove the Secondary network cable from CSS master
5- Wait 600 seconds
6- Restore Secondary network cable
Initiate all Workloads (esp. those that flood the private interconnect
used by RAC cache fusion and CSS heartbeats, and OS stress)
Remove Primary + Secondary
private network cables
Identify both CSS and CRS master nodes
CSS master node Note: Since CSS only supports one physical interface in Pre-
11.2.0.2 version, we need network interface teaming/bonding on the
private interconnects to accomplish this.
Sanity Check Steps:
1- Physically remove both Primary + Secondary private network cables
from the CSS master
2- Re-attach both network cables after the CSS master is evicted (by either
Oracle Clusterware or vendor clusterware, if present) and rebooted in pre-
11.2.0.2 version.
In 11.2.0.2, if node doesnt reboot after cssd terminate, use
crsctl stop crs f to stop the remaining clusterware
processes, re-attach both network cables, then manually
use crsctl start crs to start crs stack.
3- Wait until the former CSS master node rejoins the cluster
Variants:
Var1 - Remove both Primary + Secondary private networks from the CRS
(in lieu of CSS) master
Var2 Remove both cables back. Replace them before either the vendor
clusterware or CSS heartbeats should expire. The preferred result is that no
actions are taken, including RAC and ASM.
Oracle RAC Private Network Failure - Test Cases
Test 1
Test 2
Clusterware
Test Category
Detailed Test Execution
Initiate client Workloads
Remove Primary + Secondary
private network cables ==>
Identify the Vendor and CSS master nodes
T staggered RAC hosts Identify a set of T =N-1 RAC hosts (N=number of clustered
database hosts), including the CSS master.
Sanity Check
Steps:
1- Physically remove both Primary + Secondary private network cables
from the current CSS master
2- Re-attach the private network cables after the CSS master is evicted and
rebooted in pre-11.2.0.2 version.
processes, re-attach both network cables, then manually
use crsctl start crs to start crs stack.
3- repeats Step 1 against the surviving nodes (do not wait the reboot node
to come back and join) until there is only one surviving node left.
Variants:
Var 1: Split the cluster such that the lowest order vendor and CSS nodes
are left in the smaller node group. For example, in a 4 node cluster, split the
cluster 1-3 with the singleton as the lowest node. Similar for a 5 node
cluster split 2-3 with the 2 in the lowest node group.
Var 2 : Repeat these tests using ifdown rather than cable disconnect.
Power off private network
switches. For redundant switches,
power both down.
Identify the CSS master
Test 4
Test 2
Test 3
Steps:
1- Power off both Primary and Secondary private network switches
2- Wait for at least CSS MISSCOUNT seconds before powering back on
the private network switches
3- Wait until all nodes reboot and subsequently rejoin the cluster.
processes, power back on the private network switch, then
manually use crsctl start crs to start crs stack.
Variants:
None
Split brain resolution Identify the CSS master
This test requires 2 network switches
Sanity Check Note: Some vendor clusterware products may require the
configuration of a quorum disk to be able to run this test.
BROWOUT TIME DATA
REQUIRE
Steps:
Pull the network cables simultaneously so that Node1 can only
communicate with Node 2 and Node 3 can only communicate with Node 4.
Here I assume either N1 or N2 is the CSS master.
Wait for at least 2 * CSS MISSCOUNT seconds so the
split brain resolution algorithm kicks in.
N3 and N4 should reboot in pre-11.2.0.2 version.

In 11.2.0.2, if node doesnt reboot after cssd
terminate, use crsctl stop crs f to stop the remaining
clusterware processes, re-attach both network cables, then
manually use crsctl start crs to start crs stack.
Restore the network so N3 and N4 can join the
cluster.
Test 5
Test 4
The bonding software should failover with no impact to CSS, ASM and RAC.
Vendor Clusterware:
- Zero impact on all clusterware daemons
ASM and RAC:
- Zero impact on stability of all RAC hosts
- Zero node evictions or cluster failures
- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till the end of run. Attach the output for auditing.
Vendor Clusterware:
- When vendor clusterware heartbeat is on the same private network (recommended), it detects private
network failure and determines cluster membership changes. Oracle clusterware receives the notification
and reports the membership change to CRS and RDBMS
This is the best result for our shared customers
Oracle Clusterware:
- When no vendor clusterware heartbeat, the customer must wait for MISSCOUNT to expire. (See
misscount tuning).
RAC:
- Zero impact on stability of surviving RAC hosts.
- Uninterrupted cluster-wide I/O operations.
- No report of complete cluster failures/reboots.
-For 11.2.0.2, if all crs resources and asm&rdbms processes are cleanup prior to
the cssd terminating, node wont reboot after cssd terminate. Otherwise, node will
- Oracle Clusterware resources managed by the evicted node either go OFFLINE or fail over to a
surviving RAC node. Resources that fail over include: VIP, SCAN VIP, SCAN Listener and
singleton services.
Test Result
Oracle RAC Private Network Failure - Test Cases
Expected Test Outcome
- For 11.2.0.2, CVU resource should also failover to a surviving RAC node.
- For 11R2, collect
crsctl stat res t results in a 60s loop from beginning till the end of run. Attach the output for
auditing.
=- For policy-managed db, the evicted server will be moved from Oracle server
pool. If there is a server in Free pool, this server will be added Oracle server pool
and db instance can be automatically stated in server
Vendor Clusterware:
- Same as RAC
RAC:
- All N-1 node evictions result in successful cluster rejoins.
- Zero impact on RAC hosts stability.
still reboot.

- Oracle Clusterware resources managed by the CRS master either go OFFLINE or fail over to a
singleton services
- After nodes come back, SCAN VIP and SCAN Listener will disperse to different
nodes, should not be on the only one node.
- For 11R2, collect
RAC:
- All node evictions result in successful cluster rejoins.
- Zero impact on RAC hosts stability.
- Uninterrupted cluster-wide I/O operations at both node leave and node join as measured by the client
application
still reboot.

- Oracle Clusterware resources managed by the CRS master either go OFFLINE or fail over to a
singleton services
- After nodes come back, SCAN VIP and SCAN Listener will disperse to different
nodes, should not be on the only one node.
- For 11R2, collect
N3 and N4 reboot
N3 and N4 rejoin the cluster.
- For 11R2, collect
still reboot.

[Test Code]
Action Target
Test 1 Pub-Network -1 Preconditions:
Initiate all Workloads
Remove Primary public network cable ==> Identify both CSS and CRS master nodes
CRS master node
Steps:
1- Physically remove the Primary public network cable
from the CRS master
2- Wait 120 seconds
3- Restore Primary public network cable
4- Remove the Secondary network cable from CRS master
5- Wait 120 seconds
6- Restore Secondary public network cable
Variants:
None
Test 2 Pub-Network -2 Preconditions:
Remove Primary + Secondary public
network cables ==> CRS master node
Sanity Check Steps:
1- Physically remove both Primary + Secondary public
network cables from the CRS master (do crsctl stat res -
t > crsstat.0 before remove cables)
2- Wait until the Oracle VIP and dependent services fail
over (i.e. those services whose CRS placement policies
allow them to do so).
3- Note the time it takes for CRS to failover the VIP
(do crsctl stat res -t> crsstat.1)
4- Re-attach both public network cables.
Note:
In 11gR2, VIP should failback automatically
without human intervention; but SCAN VIP and
SCAN Listener shouldnt failback automatically.
(do crsctl stat res t > crsstat.2 after reattach
public network cable in 11R2)
Save the crsstat.[012] to /crs_log dir (see appendix C)
Oracle RAC Public Network Failure - Test Cases
Clusterware
Test Category
Vendor Clusterware:
- same as RAC
RAC:
- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till the end of run.
Attach the output for auditing.
NAS/SAN:
- No data corruption or I/O interruption reported from surviving nodes at
both node leave and node join
RAC:
- Zero impact on stability of RAC hosts.
- Uninterrupted cluster-wide I/O operations
- Oracle Clusterware resources managed by the
affected node either go OFFLINE or fail over to another RAC node.
Resources that fail over include: VIP, SCAN VIP, SCAN Listener
and singleton services
-
Oracle RAC Public Network Failure - Test Cases
Expected Test Outcome Actual Test
Outcome
[Test Code]
Action Target
Test 1 Host - Test - 1 Preconditions: Vendor Clusterware:
-
Hard fail (e.g. power off,
hard reset) RAC host ==>
Induce stress conditions high CPU
in real and user time; low swap space.
Vendor master node
Identify vendor, CSS and CRS master nodes
RAC:
- Zero impact on stability of all surviving RAC
hosts
Steps: - No other RAC hosts should fail as a result of
the master node failure
1- Forcibly reset or power off the current vendor master
2- Wait until the original CSS master reboots and rejoins
the cluster
- SCAN VIP and SCAN Listener should
failover to other node if it is on this
node before the node hard fail
Variants: - For 11R2, collect
Var 1. Split the cluster during the clusterware
reconfiguration.
crsctl stat res t in a 60s loop from
beginning till the end of run. Attach the output
for auditing.
Var 2. Fail the node that is the CSS master rather than
the clusterware master, or fail both concurrently.
- For 11.2.0.2, CVU resource should
also failover to a surviving RAC node.
Var 2.
Have CRS operations in progress such as VIP failover
and hard reset the
Test 2 Host - Test - 2 Preconditions: Vendor Clusterware:
- Vendor clusterware detects node member
leaving and subsequent rejoining, and
determines cluster reconfiguration changes
Power off multiple RAC
hosts ==>
Identify the vendor and CSS master nodes
T staggered RAC hosts Identify a set of T =N-1 RAC hosts
(N=number of clustered database hosts), including
the CSS master.
RAC:
- All node departures result in successful cluster
rejoins.
Sanity Check - Zero impact on surviving RAC hosts stability.
Steps: - Uninterrupted cluster-wide I/O operations.
RECONFIG TIME DATA 1- Reboot the current CSS master - No report of complete cluster failures/reboots.
REQUIRE 2- repeats Step 1 against the surviving nodes until there
is only one surviving node left 3- If possible, determine
the interim time (in sec) database I/Os experience
freezes, if any
CRS master either go OFFLINE or fail over to a
surviving RAC node. Resources that fail over
include: SCAN VIP and SCAN Listener
should failover to other node if it is on
this node before the node hard fail
Variants:
Oracle RAC HOST Failures Test Cases
Clusterware Test
Category
Detailed Test Execution Expected Test Outcome
None - After nodes come back, SCAN VIP
and SCAN Listener will disperse to
different nodes, should not be on the
only one node.
- For 11R2, collect
for auditing.

Oracle RAC HOST Failures Test Cases
Actual Test Outcome
[Test Code]
Action Target
Test 1 HA-Test 1 Preconditions:
Type `cluvfy` to see all available
command syntax and options
Run multiple cluvfy operations
during Oracle Clusterware and RAC
install
All RAC hosts Steps:
1- Run cluvfy precondition
Sanity Check 2- Do the next install step
3- Run cluvfy post-condition
(cluvfy comp software n node_list) to check
the file permissions
No need to collect CRS/RDBMS log for this test. You need
to submit the output for cluvfy.
Test 2 HA-Test 2 Preconditions:
Run concurrent crsctl
start/stop crs commands to
stop or start Oracle Clusterware in
planned mode All RAC hosts
Type `crsctl` as root to see all
available command syntax and options
Sanity Check
Steps:
1- As root user, run `crsctl stop crs` command concurrently
on more than one RAC host, to stop the resident Oracle
Clusterware stack
2- Wait until the target Oracle Clusterware stack is fully
stopped (via `ps` command)
3- As root user, run `crsctl start crs` command concurrently
on more than one RAC host, to start the resident Oracle
Clusterware stack
Test 3 HA - Test 3 Preconditions:
Run other concurrent crsctl
commands, such as crctl check
crs, ==> All RAC hosts

Steps:
1- As root user, run any `crsctl check
crs` commands concurrently on all nodes
2- As root user, run any `crsctl check
cluster -all` commands concurrently on all nodes
Oracle High Availibility Testing
Clusterware
Test
Category
Test 4 HA - Test 4 Preconditions:
Remove and add voting disk files
==> Random RAC hosts
Ensure the Oracle Clusterware has 3
or more CSS voting disk files
Not 11R2 new feature, VF not
in ASM diskgroup.
Steps:
Make sure ocssd.bin is up on all
nodes.
As root user, run `crsctl query css
votedisk`
Run multiple `crsctl delete css votedisk ` until
one left, CRS should not allow you to delete the very
last one.
Run `crsctl add css votedisk` (e.g. by adding
back the voting disk files that were previously deleted)
Finally, run `crsctl query css votedisk`
again
Vendor Clusterware:
- same as RAC
RAC:
- Correct cluster verification checks given the state
of the cluster hardware and software
Pls provide cvu related logs under
$CRS_HOME/cv/log

Vendor Clusterware:
- N/A
RAC:
- Stop: All Oracle Clusterware daemons stop without leaving open
ports or zombie processes
- Start: All Oracle Clusterware daemons start without error messages
in stdout or any of the CRS, CSS or EVM traces
- Start: All registered HA resource states match the target states,
as per crsctl stat res t

- For 11R2, collect
crsctl stat res t in a 60s loop from beginning till the end of
run. Attach the output for auditing.
Vendor Clusterware:
- same as RAC
RAC:
- Both `crsctl check crs` and `crsctl check
cluster -all` commands produce the appropriate,
useful output, without any error messages
- Collect output for step 1 and step 2
Oracle High Availibility Testing
Outcome
RAC:
- Voting disk files are added and removed
without failures or error messages
- The crsctl query presents the correct state of
all voting disk files
[Test Code]
Action Target
Test 1 11grR2 - Case - 1 Preconditions:
Make sure non-ASM voting files are used;
11gR2 new features of
using ASM Voting files
and ASM OCR files, this
is the OCR/VF migration
to ASM test.
Make sure no ASM OCR files are used
Make sure at least one normal redundancy ASM
Diskgroup with three failgroups is created and its
compatible.asm attribute is set to 11.2;
Sanity Check
Steps:
1- Make sure crs stack are running in all nodes.
2- Run crsctl query css votedisk to check configured VFs;
3- Run crsctl replace votedisk +{ASM_DG_NAME}(As crs
user or root user);
4- Run crsctl query css votedisk to get the new VF list;
5- Run ocrconfig add +{ASM_DGNAME} as root user;
6- Run ocrcheck to verify the OCR files;
7- Restart CRS stack and then verify the VF/OCR after it
comes back;
Variants:
1. Add up to 5 OCR files and restart CRS stack;
2. Try to migrate VF from ASM back to non-ASM files and then
restart CRS stack;
crsctl command to
manage Oracle
clusterware stack
CRS stack is up and running on all nodes.
Sanity Check Steps:
1- Run crsctl check cluster all to get the stack status on
all cluster nodes. Make sure stack status of all cluster nodes
are correct;
2- Run crsctl stop cluster all to stop all CRS resource
(CSSD/CRSD/EVMD) with application resources;
3- Run crsctl status cluster all to make sure CRS
resource are OFFLINE;
4- Run crsctl start cluster all to bring back the whole
cluster stack
Initiate Workloads
11gR2 New Features Failover Cases
Clusterware
Test Category
11gR2 new feature. OCR
stores in ASMs
diskgroup.
Steps:
Sanity Check Make sure only ASM OCR files are used by ocrcheck
config;
Kill the ASM pmon process on the OCR Master node;

Variants:
Repeat the same test on non-OCR Master node.

11.2.0.2 new feature.
Redundant interconnect
Usage (HAIP)
During Clusterware installation, configure 2 or more
(<=4) private NIC as cluster interconnect. For example,
there are two private cluster interconnect P1 and P2.
Sanity Check
Initiate all Workloads (esp. those that flood the
private interconnect used by RAC cache fusion and CSS
heartbeats, and OS stress)
Steps:
1- Physically remove the one private network cable (P1) from CSS master/the
vendor clusterware master
2- Wait 600 seconds
3- Restore network cable (P1)
4- Remove another network cables (P2) from CSS master
5- Wait 600 seconds
6- Restore Secondary network cable (P2)
Note:
Use ifconfig and oifcfg getif -global to save the output
before/after the fault injection.

11.2.0.2 new feature.
Redundant interconnect
Usage (HAIP)
During Clusterware installation, configure 2 or more
(<=4) private NIC as cluster interconnect.
Sanity Check
Initiate all Workloads (esp. those that flood the
private interconnect used by RAC cache fusion and CSS
heartbeats, and OS stress)
Steps:
1- Physically remove all configured interfaces from CSS master/the
vendor clusterware master
2. Re-attach both network cables after the CSS master is evicted (by either Oracle
Clusterware or vendor clusterware, if present) and rebooted.
crsctl stop crs f to stop the remaining clusterware processes, re-
attach both network cables, then manually use crsctl start crs
to start crs stack.
3- Wait until the former CSS master node rejoins the cluster
Note:
Use ifconfig and oifcfg getif -global to save the output
before/after the fault injection.
RAC:
- In 11gR2, Voting Disks can
be on ASM diskgroup and it is
managed by ASM instance if they
resided on ASM diskgroup. It
means we can not add/delete
Voting files if they are on ASM;
- In 11gR2, we can support up
to 5 OCRs;
RAC:
- After running crsctl stop
cluster all, make sure all
ocssd/evmd/crsd processes are
stopped on all cluster nodes by ps
ef.
- For 11R2, collect
for auditing.
Clusterware:
- Because OCR is stored in ASM, if ASM
fails or is brought down, CRSD will fail
because it depends on ASM for I/O
11gR2 New Features Failover Cases
Outcome
- ASM, CRSD and RDBMS instance will
be automatically restarted.
- After CRSD restart, all resources
state shouldnt change.
(CRSD should recover resources
previous state)
- For 11R2, collect
for auditing.
If one of the interfaces fails, then the
HAIP address moves to another one of
the configured interfaces in the
defined set.
Vendor Clusterware:
- Zero impact on all clusterware daemons
ASM and RAC:
- For 11R2, collect
for auditing.
Oracle Clusterware:
- When no vendor clusterware heartbeat, the
customer must wait for MISSCOUNT to expire.
(See misscount tuning).
RAC:
- Zero impact on stability of surviving RAC
hosts.

-For 11.2.0.2, if all crs resources and
asm&rdbms processes are cleanup
prior to the cssd terminating, node
wont reboot after cssd terminate.
Otherwise, node will still reboot.
evicted node either go OFFLINE or fail over to a
surviving RAC node. Resources that fail over
include: VIP, SCAN VIP, SCAN Listener
and singleton services.
- For 11R2, collect
crsctl stat res t results in a 60s loop from
for auditing.

Clusterware Testing Failures

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Clusterware Testing Failures

Cargado por

Copyright:

Formatos disponibles

[Test Code]

También podría gustarte