Troubleshooting ServeRAID II Subsystems in a Cluster Environment


Troubleshooting ServeRAID II Subsystems in a Cluster Environment



Following is a series of common problems and solutions that can help you troubleshoot your high-availability solution.


Problem: ServeRAID Administration Utility program shows physical devices as DDD state.

Action:

  1.  Enable the shared disk display function to allow disks that have been moved or have failed over to the other node  in the cluster to be displayed as RSV state instead of DDD (defunct) state.

    Note: It is normal for disks that have been moved or have failed over to be displayed in the DDD state if the  shared disk display feature has not been enabled.
     In this case the disks shown in the DDD state are not really defective.

  2.  Check RAID level-1 and RAID level-5 arrays to make sure they are not in critical state.
     If they are in critical state, replace the failed disk and perform a rebuild operation.


Problem: ServeRAID shared logical drives do not failover properly.

Action:

  1.  Ensure that the resource type of each ServeRAID adapter shared disk resource is IBM ServeRAID logical disk.

     If the resource type is shown as physical disk, the localquorum option was not specified properly when MSCS was  installed.

     To correct this problem, you must reinstall a high-availability cluster solution using Microsoft Windows NT.
     Refer to Chapter 3, 'Installing the ServeRAID II Adapter for a High-Availability Cluster Solution Using Windows NT' on  page 7 for instructions.

  2.  Ensure that shared SCSI buses on the ServeRAID adapter pair are connected so that corresponding SCSI  channels are connected (for example, SCSI channel 1 on the adapter in the first cluster server is connected to  SCSI channel 1 on the adapter in the second cluster server, channel 2 is connected to channel 2, and so forth).

  3.  Ensure that physical SCSI disks that contain logical drives are all connected to shared SCSI channels.
  4.  Ensure that there are no more than eight shared logical disk drives defined per pair of ServeRAID II adapters for  use in your cluster.
  5.  For Windows NT Server clusters, ensure that SCSI channel 3 of the adapter pair that attaches to the ServeRAID  logical drive that has been designated as the Windows NT Cluster Quorum Resource is used for arbitration.
     Also, ensure that it is connected from the first cluster server to SCSI channel 3 in the second cluster server and that  there are no SCSI devices connected to that channel.

     The SCSI heartbeat connection must be connected to the third channel of the ServeRAID adapter pair that has  the quorum drive connected to it. No disks can be installed on this heartbeat channel.
     If you choose to move the quorum drive to another ServeRAID II adapter, ID level 5 you must also move the SCSI heartbeat cable on both  servers to the new quorum ServeRAID adapter pair.
     For more information, see 'ServeRAID II Considerations'.

  6.  The quorum disk can be placed on any ServeRAID channel shared by cluster server.
  7.  Make sure each shared logical drive has a Merge ID assigned.
     Merge IDs must be in a range of 1 to 8.

  8.  Make sure each ServeRAID II adapter has been assigned a unique Host ID and that each ServeRAID II adapter  has its cluster partner Host ID assigned properly to correspond to the ServeRAID II adapter in the other cluster  server that is attached to the shared SCSI buses.

  9.  Check for loose shared SCSI bus cables.
  10.  Ensure that SCSI repeater cards in Model 3518 or 3519 disk expansion enclosures are at the latest revision level.
  11.  Ensure that physical disks that are expected to be moved or failover show up as RDY or RSV state on the server  that is attempting to take ove control of these disks.


Problem: RAID level-5 logical disks cannot be accessed by the operating system after a failover.

Action: Use the ServeRAID Administration Utility program to check the state of the logical disk drive to ensure that it is not blocked.
Using the utility program, select the logical disk drive and look for Blocked state Yes.
If the logical disk drive is blocked, make sure all physical disks that are part of the logical drive are in the ONL state. If all physical disks are not in the ONL state, a disk might have gone bad during a failover or during the resynchronization process after a failover.
Data integrity cannot be guaranteed in this case and the array has been blocked to prevent the possibility of incorrect data being read from the logical drive.

Reinitialize and synchronize the logical drive and restore the data from a backup source. Depending on the type of data contained on the logical drive and the availability of a recent backup copy, you can unblock the drive and continue normal operation or replace/rebuild one or more DDD disks. However, if you do not reinitialize, synchronize and restore the drive, be aware that some data on the disk drive could be lost or corrupted.


Problem: If one of the cluster servers fails and the surviving server takes over the cluster resources, occasionally one or more of the IP address resources will stay in the ONL pending state for several minutes after moving over to the surviving server. After this, the resource will go to the failed state and the following error message will be displayed in the surviving server's system log (as viewed with the Event Viewer).

For Example:

       Windows NT  Event Log  Message:
      Date:     ??? Event  ID: 1069
      Time:     ??? Source:    ClusSvc
      User:     N/A Type:      Error
      Computer: ??? Category:  (4)
      Description:
      Cluster resource 'ip address resource name' failed


Action: No action is necessary to bring the resource online after the failover. After about three minutes MSCS will successfully reattempt to bring this resource online on the surviving server. However, the following workaround will reduce the time for the IP addresses to come online.

  1.  Using the cluster administrator, right-click the IP address resource that is exhibiting this problem.
     This will display a context-sensitive menu.
  2.  Select Properties: in the context-sensitive menu.
     This will display the Properties dialog box for the IP address resource.

  3.  Select General tab. This will display the general setting for the IP address resource.
  4.  Enable the Run this resource in a separate resource monitor function
  5.  Choose OK.

    Note: Changes will take effect the next time the resource is brought online.

  6.  Locate the Pending timeout edit box which will be at the bottom of this dialog box. The value here will be at the  default of 180 seconds (unless it has been changed previously).
  7.  Change this Pending timeout to a lower value.
     A value in the range of 15 to 20 seconds will usually reduce the time for the IP addresses to come online after a failover of less than two minutes.
     Note that this does not prevent the Cluster resource 'ip address resource name' error message from coming up, however, the time taken for  this message to come up, and hence for the retry and the resource to come online will be much less.


Problem: After one of the cluster servers has been shut down normally and the surviving server takes over the cluster resources, occasionally one or more of the IBM ServeRAID logical disk resources will stay in the online pending  state for several minutes, after moving over to the surviving server (when viewed with the Cluster Administrator).
After this, the resource will go to the failed state and the following error message will be displayed in the surviving server system log (as viewed with the Event Viewer).


For Example:

       Windows NT  Event Log  Message:
      Date:     ??? Event  ID: 1069
      Time:     ??? Source:    ClusSvc
      User:     N/A Type:      Error
      Computer: ??? Category:  (4)
      Description:
      Cluster resource 'IBM ServeRAID Logical Disk name' failed


Action: No action is necessary to bring the resource online after the failover.
MSCS will successfully reattempt to bring this resource online on the surviving server within about four minutes.


Problem: You cannot reinstall the ServeRAID Windows NT Cluster Solution.
If a previous version of IBM ServeRAID Cluster Solution has been uninstalled, when attempting to reinstall the IBM ServeRAID Windows NT Cluster solution, a message incorrectly appears asking if you want to perform an upgrade.


Action: You must delete C3E76E53-F841-11D0-BFA1-08005AB8ED05  registry key.
To delete the registry key, do the following:

  1.  Select RUN
  2.  Type: REGEDIT and click OK. The Registry Editor screen appears.
  3.  Select HKEY_CLASSES_ROOT\CLSID and delete C3E76E53-F841-11D0-BFA1-08005AB8ED05
  4.  Reinstall the ServeRAID Windows NT Cluster Solution.
     Refer to Chapter 3, 'Installing the ServeRAID II Adapter for a High-Availability Cluster Solution Using Windows NT' for instructions.


Problem: The following error message is displayed when running the IPSHAHTO program on a server. The warning message is: "Warning: CONFIG_SYNC with 0xA0 command FAILED on Adapter #"  and one or more HSP/SHS devices are defined on adapter pairs or READY (RDY) devices are not part of any logical array configuration on an adapter pair in a cluster.

Action: If all shared disk resources moved successfully when running the IPSHAHTO program, it is safe to ignore the error message and no further action is required.

If shared disk resources fail to move when running the IPSHAHTO program, perform a low-level format on all HSP/SHS and RDY devices that are not part of any logical array configuration on an adapter pair in a cluster.
Refer to the instructions on low-level formatting HSP/SHS drives in 'ServeRAID Adapter Installation and User's Guide', (P/N 4227022) for further details.


Problem: Array identifiers and logical drive numbers might change during a failover condition.


Action: By design, the array identifiers and logical drive numbers may change during a failover condition.
Consistency between the merge identifiers and Windows NT sticky drive letters is maintained, while the ordering process during a failover condition is controlled by the Microsoft Cluster Management Software and the available array identifiers on the surviving server.


Back to  Jump to TOP-of-PAGE

Please see the LEGAL  -  Trademark notice.
Feel free - send a Email-NOTE  for any BUG on this page found - Thank you.