Troubleshooting ServeRAID Subsystems in a Cluster Environment


Troubleshooting ServeRAID Subsystems in a Cluster Environment



Following is a series of common problems and solutions that can help you troubleshoot your high-availability solution.


Problem: ServeRAID Administration Utility program shows physical devices as DDD state.
Action:

  1.  Enable the shared disk display function to allow disks that have been moved or have failed over to the other node in the cluster to be  displayed as RSV state instead of DDD state.

    Note: It is normal for disks that have been moved or have failed over to be displayed  in the DDD state if the shared disk display feature has not been enabled.
     In this case the disks shown in the DDD state are not really defective.

  2.  Check RAID level 1 and RAID level 5 arrays to make sure they are not in critical state.
     If they are in critical state, replace the failed disk and perform a rebuild operation.


Problem: ServeRAID shared logical drives do not failover properly.

Action:

  1.  Ensure that the resource type of each ServeRAID adapter shared disk resource is "IBM ServeRAID logical disk".

     If the resource type is shown as "physical disk", the localquorum option was not specified properly when MSCS was installed.

     To correct this problem, you must reinstall a high-availability cluster solution using Microsoft Windows NT.
     Refer to Chapter 4, 'Installing a High-Availability Cluster Solution Using Windows NT' for instructions.

  2.  Ensure that shared SCSI buses on the ServeRAID adapter pair are connected such that corresponding SCSI channels are connected
     (for example SCSI channel 1 on the adapter in the first cluster node server is
     connected to SCSI channel 1 on the adapter in the second cluster node server,
     channel 2 is connected to channel 2, and so forth).

  3.  Ensure that physical SCSI disks that contain logical drives are all connected to shared SCSI channels.
  4.  Ensure that there are no more than 8 shared logical disk drives defined  per pair of ServeRAID adapters for use in your cluster.
  5.  For Windows NT Server clusters, ensure that SCSI channel 3 of the  adapter pair that attaches to the ServeRAID logical drive that has been  designated as the NT Cluster Quorum Resource is used for arbitration.
     Also, ensure that it is connected from the first cluster node to SCSI channel 3 in the second cluster node and that there are no SCSI devices  connected to that channel.

     The SCSI heartbeat connection must be connected to the third channel of the ServeRAID adapter pair that has the quorum drive connected to it.
     No disks can be installed on this heartbeat channel.
     If you choose to move the quorum drive to another ServeRAID adapter, ID level 5 you must also move the SCSI heartbeat cable on both servers to the new  quorum ServeRAID adapter pair. For more information, see 'ServeRAID Considerations'.

  6.  The quorum disk can be placed on any ServeRAID channel shared by cluster node server.
  7.  Make sure each shared logical drive has a Merge ID assigned.
     Merge IDs must be in a range of 1 to 8
  8.  Make sure each ServeRAID adapter has been assigned a unique Host ID and that each ServeRAID adapter has its cluster partner Host ID  assigned properly to correspond to the ServeRAID adapter in the other cluster node server that is attached to the shared SCSI buses.

  9.  Check for loose shared SCSI bus cables.
  10.  Ensure that SCSI repeater cards in Model 3518 or 3519 disk expansion enclosures are at the latest revision level.
  11.  Ensure that physical disks that are expected to be moved or failover show up as RDY or RSV state on the node that is attempting to take  over control of these disks.


Problem: RAID level 5 logical disks cannot be accessed by the operating system after a failover.

Action: Use the ServeRAID Administration Utility program to check the state of the logical disk drive to ensure that it is not blocked. Using the utility program, select the logical disk drive and look for Blocked state Yes.
If the logical disk drive is blocked, make sure all physical disks that are part of the logical drive are in the ONL state. If all physical disks are not in the ONL state, then a disk might have gone bad during a failover or during the resynchronization process after a failover.
Data integrity cannot be guaranteed in this case and the array has been blocked to prevent the possibility of incorrect data being read from the logical drive.

Reinitialize and synchronize the logical drive and restore the data from a back up source.
Depending on the type of data contained on the logical drive and the availability of a recent back up copy, you can unblock the drive and continue normal operation or replace/rebuild one or more DDD disks.
However, if you do not reinitialize, synchronize and restore the drive, be aware that some data on the disk drive could be lost or corrupted.


Problem: If one of the cluster nodes fails and the surviving node takes over the cluster resources, occasionally one or more of the IP address resources will stay in the ONL pending state for several minutes after moving over to the surviving node. After this, the resource will go to the failed state and the following error message will be displayed in the surviving node's system log (as viewed with the Event Viewer).

     For Example: NT Event Log Message:
     Date: ??? Event ID: 1069
     Time: ??? Source: ClusSvc
     User: N/A Type: Error
     Computer: ??? Category: (4)
     Description:
     Cluster resource 'ip address resource name' failed


Action: Do the following:

  1.  Using the right mouse button, click on the IP resource in the Cluster Administrator.
  2.  Select Properties from the General tab.
  3.  Check the box labeled Run this resource in a separate Resource Monitor

     A message appears stating that the resource must be restarted for the change to take effect.


Problem: After one of the cluster nodes (servers) has been shutdown normally and the surviving node (server) takes over the cluster resources, occasionally one or more of the IBM ServeRAID logical disk resources will stay in the 'online pending' state for several minutes, after moving over to the surviving node (server) (when viewed with the Cluster Administrator).
After this, the resource will go to the 'failed state' and the following error message will be displayed in the surviving node's (server's) system log (as viewed with the Event Viewer)
.

     For Example: NT Event Log Message:
     Date: ??? Event ID: 1069
     Time: ??? Source: ClusSvc
     User: N/A Type: Error
     Computer: ??? Category: (4)
     Description:
     Cluster resource 'IBM ServeRAID Logical Disk name' failed


Action: No action is necessary to bring the resource online after the failover.
MSCS will successfully re-attempt to bring this resource online on the surviving node within about four minutes.


Problem: You cannot reinstall the ServeRAID Windows NT Cluster Solution.
If a previous version of IBM ServeRAID Cluster Solution has been uninstalled, when attempting to reinstall the IBM ServeRAID Windows NT Cluster solution, a message incorrectly appears asking if you want to perform an upgrade.


Action: You must delete C3E76E53-F841-11D0-BFA1-08005AB8ED05  registry key.

To delete the registry key, do the following:

  1.  Select RUN
  2.  Type: REGEDIT and click OK. The Registry Editor screen appears.
  3.  Select "HKEY_CLASSES_ROOT\CLSID" and delete C3E76E53-F841-11D0-BFA1-08005AB8ED05 
  4.  Reinstall the ServeRAID Windows NT Cluster Solution.
     Refer to Chapter 4, 'Installing a High-Availability Cluster Solution Using Windows NT' for instructions.


Problem: The following error message is displayed when running the IPSHAHTO program on a node.
The warning message is:
"Warning: CONFIG_SYNC with 0xA0 command FAILED on Adapter #" and one or more HSP/SHS devices are defined on adapter pairs or READY (RDY) devices are not part of any logical array configuration on an adapter pair in a cluster.

Action: If all shared disk resources moved successfully when running the IPSHAHTO program, it is safe to ignore the error message and no further action is required.

If shared disk resources fail to move when running the IPSHAHTO program, then perform a low-level format on all HSP/SHS and RDY devices that are not part of any logical array configuration on an adapter pair in a cluster.
Refer to the instructions on low-level formatting HSP/SHS drives in ServeRAID Adapter Installation and User's Guide  (P/N4227022) for further details.


Problem: Array identifiers and logical drive numbers might change during a failover condition.

Action: By design, the array identifiers and logical drive numbers may change during a failover condition.
Consistency between the merge identifiers and Windows NT sticky drive letters is maintained, while the ordering process during a failover condition is controlled by the Microsoft Cluster Management Software and the available array identifiers on the surviving node.


Back to  Jump to TOP-of-PAGE

Please see the LEGAL  -  Trademark notice.
Feel free - send a Email-NOTE  for any BUG on this page found - Thank you.