Clustering - Cluster Failover Testing

Applicability:

SQL Server 2000: Tested

SQL Server 2005: Tested

SQL Server 2008: Not Tested

SQL Server 2008R2: Not Tested

SQL Server 2012: Not Tested

Credits:

Author: Unknown

Date: 28 Aug 2008

Description

Not really sure where this list came from, but it certainly encompasses all of the things I'd recommend not taking for granted when implementing a cluster. Most of the instances are highly unlikley to happen, but one can never be too safe. It's a lot easier position to defend if a failure does happen that hasn't been tested if one can demonstrate.

The examples below assume a 2-node cluster, but are as valid for a multi-instance, multi-node cluster.

Therefore for ultimate completeness, ideally the faiover tests should be performed for all possible combinations; so with a 3 node cluster, one should perform a failover check in each direction between in each pair of nodes (6 checks). However, this is not always pracitcal or possible

Aside from the testing with the Cluster Administrator as outlined in Technet I have ALWAYS performed as many of the following tests as possible after clustering and SQL Server is configured and installed (NOTE - All tests start with both nodes up):

- Active Node Hardware Failure - On the 'active' node perform a shutdown with no restart - things should failover to the 'passive' node.
- Passive Node Hardware Failure - On the 'passive' node perform a shutdown with no restart - things should remain 'active' on the active node.
- Active Node Power Failure - On the 'active' node pull the power cord - things should failover to the 'passive' node. NOTE: If there are redundant power supplies, you will need to pull all power cords.
- Passive Node Power Failure - On the 'passive' node pull the power cord - things should remain 'active' on the active node.
- Active Node Network Failure - On the 'active' node pull the network cable out - things should failover to the 'passive' node.
- Passive Node Network Failure - On the 'passive' node pull the network cable - things should remain 'active' on the active node.
- Cluster Heartbeat Failure - Unplug the crossover cable - both nodes should remain up but the cluster administrator will complain.
- Active Node Fiber Failure - On the 'active' node fiber cable out - things should failover to the 'passive' node. NOTE: If there are redundant fiber paths, you will need to pull all of them out.
- Passive Node Fiber Failure - On the 'passive' node pull the fiber cable - things should remain 'active' on the active node.

The 'stuff' above is what I prefer to call 'fun'. But you need to make it educational while you do it. To accomplish this examine the System, Application and Security Event logs before and after each test on both nodes, and possibly even keep a labelled copy of the relevant sections. By doing this you will be one up on diagnosis of a real problem when it occurs. Also do not forget to examine the actual cluster log located at: C:\Windows\Cluster\cluster.log. This file wraps like a transaction log. Also, the time used is in GMT.

Code

N/A