EXECUTIVE OVERVIEW


EXECUTIVE OVERVIEW


As more companies make the move to LAN based client server models, IBM servers and their associated drive subsystems are becoming larger and storing more mission critical information. Because of this, the availability of these systems is more and more important. Protecting the data stored is vital. This manual focuses on the actions necessary to properly maintain a RAID disk array and how to recover from the most common types of failures in RAID disk arrays.

IBM provides management software, NetFinity Manager, to monitor the status of the hardware and provide alerts when conditions are not optimal. IBM provides this software and upgrades at no additional charge for all customers that have purchased an IBM server that ships with ServerGuide so that customers can obtain all of the information necessary to confidently protect their data. Installation of NetFinity Manager, or of similar tools, to monitor and track the health of the disk subsystem is critical to the protection of the data stored. Without these tools, the failures listed below, and other system warnings, such as Predictive Failure Analysis or SMART alerts, cannot be communicated to the operator so that preventative action can be taken.

There are three types of drive failures that can typically occur in a RAID-5 or RAID-1 subsystem that may threaten the protection of this data:


'Catastrophic' Drive Failures 

When the data on a drive is completely inaccessible due to mechanical or electrical problems, we define this as a catastrophic or complete drive failure. In these cases, all data stored on the drive, including the FCC data written on the drive to protect information, is inaccessible. This is where RAID-1 and RAID-5 level arrays provide the most common protection. A RAID-5 or RAID-1 array stores redundant, or 'parity' information within the array of drives. This parity information can be used to recreate the data from the lost drive. The information will be recalculated 'on the fly' in response to user requests and can also be used to rebuild the lost drive's data either immediately to a hot spare drive or when the failed drive has been replaced. RAID-1 and RAID-5 arrays protect from the loss of a single drive within the array. Failure of more than one drive will require restoring information from a backup device.

Problem: The RAID-5 technology can not reconstruct the data correctly unless the RAID-5 parity throughout the drives is correct. The RAID-1 logical drive does not reconstruct using parity inft)rmation. Therefore, RAID-1 logical drives are not affected.

Prevention: For the IBM ServeRAID and ServeRAID II Adapters, Synchronization is required before installing an operating system or storing any customer data on a RAID-5 array to ensure the parity correctly reflects the data. The RAID-5 arrays write data Out to drives in stripe units. The size of the stripe unit can be configured to 8KB, 16KB, 32KB, or 64KB. Synchronization reads all the data bits in each stripe unit, calculates the parity for that data, compares the calculated parity with the existing parity for all stripe units in the array, and updates the existing parity for all stripe units that are inconsistent. Once the logical drive has been synchronized, the RAID-5 parity will remain synchronized until it is redefined.


Grown Sector Media Frrors 

Sector media errors only affect a small area of the surface of the drive and do not constitute a catastrophic drive failure. These errors are typically identified when the corresponding data is requested by an application program. Often, the drive itself can repair these errors by recalculating lost data from Error Correction Code (ECC) information stored within each data sector on the drive. The drive then remaps this damaged sector to an unused area of the drive to prevent data loss.

Problem: Media Sector Errors may not be detected in seldom used files or in non-data areas of the disk. These errors will only be identified and corrected if a read or write request is made to data that is stored within that location.

Prevention: Data Scrubbing forces all sectors in the logical drive to be accessed so that Media Sector Errors are detected by the drive. Once detected, the drive's error recovery procedures will be invoked to repair these errors by recalculating the lost data from the FCC information described above. If the ECC information is not sufficient to recalculate the lost data, the information may still be recovered if the drive is part ofa RAID-5 or RAID-1 array. RAID-5 and RAID-1 arrays can provide their own redundant information (similar to the FCC data written on the drive itself) which is stored on other drives in the array. The RAID adapter can recalculate the lost data and remap the bad sector. An easy process used to accomplish Data Scrubbing is synchronization. Data Scrubbing can be performed in the background while allowing concurrent user disk activity on RAID-5 and RAID-1 logical drives. With the IBM ServeRAlDIl Adapter, Data Scrubbing is performed by the Firmware of the adapter as a background process. With all other IBM RAID Adapters, an easy tool used to accomplish Data Scrubbing is Synchronization. Netfinity Manager 5.0 will allow you to automatically schedule the synchronization from either the server or the remote manager. Netfinity Manager 5.0 can be obtained at no additional charge by customers that have ServerGuide which ships with every IBM server. If the customer has another type of scheduler such as the AT scheduler built into Windows NT or RFXXWARF by Simware Corporation, then the IBM ServeRAID and ServeRAlD II adapter command line utilities may be used to allow the customer to schedule Data Scrubbing without Netfinity Manager installed. Refer to the Data Scrubbing Utilities Available via Array Synchronization section for the adapter and operating system compatibility matrix for these Data Scrubbing utilities.


Combination Failures 

Problem: When a catastrophic drive failure occurs while there are still undetected and therefore uncorrected sector media errors on the remaining drives in the array, the array will not be able to rebuild all the data. Just as if two drives had failed, the array will be missing BOTH the information stored on the lost drive and the information from the sector where there is a media error. This constitutes a double failure and files will need to be restored from backup media.

Prevention: IBM recommends Data Scrubbing all RAID-5 and RAID-l logical drives weekly to minimize the risk of having any undetected sector media errors on the remaining drives of the array when a drive failure occurs.


Back to  Jump to TOP-of-PAGE

Please see the LEGAL  -  Trademark notice.
Feel free - send a Email-NOTE  for any BUG on this page found - Thank you.