Failures on RAID-5 Volumes

Failures are seen in two varieties: system failures and disk failures. A system failure means that the system has abruptly ceased to operate due to an operating system panic or power failure. Disk failures imply that the data on some number of disks has become unavailable due to a system failure (such as a head crash, electronics failure on disk, or disk controller failure).

System Failures

RAID-5 volumes are designed to remain available with a minimum of disk space overhead, if there are disk failures. However, many forms of RAID-5 can have data loss after a system failure. Data loss occurs because a system failure causes the data and parity in the RAID-5 volume to become unsynchronized. Loss of synchronization occurs because the status of writes that were outstanding at the time of the failure cannot be determined.

If a loss of sync occurs while a RAID-5 volume is being accessed, the volume is described as having stale parity. The parity must then be reconstructed by reading all the non-parity columns within each stripe, recalculating the parity, and writing out the parity stripe unit in the stripe. This must be done for every stripe in the volume, so it can take a long time to complete.

Caution While the resynchronization of a RAID-5 volume without log plexes is being performed, any failure of a disk within the volume causes its data to be lost.

Besides the vulnerability to failure, the resynchronization process can tax the system resources and slow down system operation.

RAID-5 logs reduce the damage that can be caused by system failures, because they maintain a copy of the data being written at the time of the failure. The process of resynchronization consists of reading that data and parity from the logs and writing it to the appropriate areas of the RAID-5 volume. This greatly reduces the amount of time needed for a resynchronization of data and parity. It also means that the volume never becomes truly stale. The data and parity for all stripes in the volume are known at all times, so the failure of a single disk cannot result in the loss of the data within the volume.

Disk Failures

An uncorrectable I/O error occurs when disk failure, cabling or other problems cause the data on a disk to become unavailable. For a RAID-5 volume, this means that a subdisk becomes unavailable. The subdisk cannot be used to hold data and is considered stale and detached. If the underlying disk becomes available or is replaced, the subdisk is still considered stale and is not used.

If an attempt is made to read data contained on a stale subdisk, the data is reconstructed from data on all other stripe units in the stripe. This operation is called a reconstructing-read. This is a more expensive operation than simply reading the data and can result in degraded read performance. When a RAID-5 volume has stale subdisks, it is considered to be in degraded mode.

A RAID-5 volume in degraded mode can be recognized from the output of the vxprint -ht command as shown in the following display:

V  NAME      RVG/VSET/CO     KSTATE     STATE     LENGTH     READPOL      PREFPLEX      UTYPE
PL  NAME      VOLUME     KSTATE     STATE     LENGTH     LAYOUT      NCOL/WID      MODE
SD  NAME      PLEX     DISK     DISKOFFS     LENGTH     [COL/]OFF      DEVICE      MODE
SV  NAME      PLEX     VOLNAME     NVOLLAYR     LENGTH     [COL/]OFF      AM/NM      MODE
...
v  r5vol      -     ENABLED     DEGRADED     204800     RAID      -      raid5
pl  r5vol-01      r5vol     ENABLED     ACTIVE     204800     RAID      3/16      RW
sd  disk01-01      r5vol-01   disk01       0     102400     0/0      c2t9d0      ENA
sd  disk02-01      r5vol-01   disk02       0     102400     1/0      c2t10d0      dS
sd  disk03-01      r5vol-01   disk03       0     102400     2/0      c2t11d0      ENA
pl  r5vol-02      r5vol     ENABLED     LOG     1440     CONCAT      -      RW
sd  disk04-01      r5vol-02   disk04       0     1440     0      c2t12d0      ENA
pl  r5vol-03      r5vol     ENABLED     LOG     1440     CONCAT      -      RW
sd  disk05-01      r5vol-03   disk05       0     1440     0      c2t14d0      ENA

The volume r5vol is in degraded mode, as shown by the volume state, which is listed as DEGRADED. The failed subdisk is disk02-01, as shown by the MODE flags; d indicates that the subdisk is detached, and S indicates that the subdisk's contents are stale.

Note Do not run the vxr5check command on a RAID-5 volume that is in degraded mode.

A disk containing a RAID-5 log plex can also fail. The failure of a single RAID-5 log plex has no direct effect on the operation of a volume provided that the RAID-5 log is mirrored. However, loss of all RAID-5 log plexes in a volume makes it vulnerable to a complete failure. In the output of the vxprint -ht command, failure within a RAID-5 log plex is indicated by the plex state being shown as BADLOG rather than LOG. This is shown in the following display, where the RAID-5 log plex r5vol-02 has failed:

V  NAME      RVG/VSET/CO     KSTATE     STATE     LENGTH     READPOL      PREFPLEX      UTYPE
PL  NAME      VOLUME     KSTATE     STATE     LENGTH     LAYOUT      NCOL/WID      MODE
SD  NAME      PLEX     DISK     DISKOFFS     LENGTH     [COL/]OFF      DEVICE      MODE
SV  NAME      PLEX     VOLNAME     NVOLLAYR     LENGTH     [COL/]OFF      AM/NM      MODE
...
v  r5vol      -     ENABLED     ACTIVE     204800     RAID      -      raid5
pl  r5vol-01      r5vol     ENABLED     ACTIVE     204800     RAID      3/16      RW
sd  disk01-01      r5vol-01   disk01       0     102400     0/0      c2t9d0      ENA
sd  disk02-01      r5vol-01   disk02       0     102400     1/0      c2t10d0      ENA
sd  disk03-01      r5vol-01   disk03       0     102400     2/0      c2t11d0      ENA
pl  r5vol-02      r5vol     DISABLED     BADLOG     1440     CONCAT      -      RW
sd  disk04-01      r5vol-02   disk04       0     1440     0      c2t12d0      ENA
pl  r5vol-03      r5vol     ENABLED     LOG     1440     CONCAT      -      RW
sd  disk05-01      r5vol-12   disk05       0     1440     0      c2t14d0      ENA

Default Startup Recovery Process for RAID-5

VxVM may need to perform several operations to restore fully the contents of a RAID-5 volume and make it usable. Whenever a volume is started, any RAID-5 log plexes are zeroed before the volume is started. This prevents random data from being interpreted as a log entry and corrupting the volume contents. Also, some subdisks may need to be recovered, or the parity may need to be resynchronized (if RAID-5 logs have failed).

VxVM takes the following steps when a RAID-5 volume is started:

If the RAID-5 volume was not cleanly shut down, it is checked for valid RAID-5 log plexes.
- If valid log plexes exist, they are replayed. This is done by placing the volume in the DETACHED volume kernel state and setting the volume state to REPLAY, and enabling the RAID-5 log plexes. If the logs can be successfully read and the replay is successful, go to step 2.
- If no valid logs exist, the parity must be resynchronized. Resynchronization is done by placing the volume in the DETACHED volume kernel state and setting the volume state to SYNC. Any log plexes are left in the DISABLED plex kernel state.
Any existing log plexes are zeroed and enabled. If all logs fail during this process, the start process is aborted.
If no stale subdisks exist or those that exist are recoverable, the volume is put in the ENABLED volume kernel state and the volume state is set to ACTIVE. The volume is now started.

Recovering a RAID-5 Volume

The types of recovery that may typically be required for RAID-5 volumes are the following:

Parity resynchronization and stale subdisk recovery are typically performed when the RAID-5 volume is started, or shortly after the system boots. They can also be performed by running the vxrecover command.

For more information on starting RAID-5 volumes, see Starting RAID-5 Volumes.

If hot-relocation is enabled at the time of a disk failure, system administrator intervention is not required unless no suitable disk space is available for relocation. Hot-relocation is triggered by the failure and the system administrator is notified of the failure by electronic mail.

Hot relocation automatically attempts to relocate the subdisks of a failing RAID-5 plex. After any relocation takes place, the hot-relocation daemon (vxrelocd) also initiates a parity resynchronization.

In the case of a failing RAID-5 log plex, relocation occurs only if the log plex is mirrored; the vxrelocd daemon then initiates a mirror resynchronization to recreate the RAID-5 log plex. If hot-relocation is disabled at the time of a failure, the system administrator may need to initiate a resynchronization or recovery.

Note Following severe hardware failure of several disks or other related subsystems underlying a RAID-5 plex, it may be impossible to recover the volume using the methods described in this chapter. In this case, remove the volume, recreate it on hardware that is functioning correctly, and restore the contents of the volume from a backup.

Parity Resynchronization

In most cases, a RAID-5 array does not have stale parity. Stale parity only occurs after all RAID-5 log plexes for the RAID-5 volume have failed, and then only if there is a system failure. Even if a RAID-5 volume has stale parity, it is usually repaired as part of the volume start process.

If a volume without valid RAID-5 logs is started and the process is killed before the volume is resynchronized, the result is an active volume with stale parity. For an example of the output of the vxprint -ht command, see the following example for a stale RAID-5 volume:

V  NAME      RVG/VSET/CO     KSTATE     STATE     LENGTH     READPOL      PREFPLEX      UTYPE
PL  NAME      VOLUME     KSTATE     STATE     LENGTH     LAYOUT      NCOL/WID      MODE
SD  NAME      PLEX     DISK     DISKOFFS     LENGTH     [COL/]OFF      DEVICE      MODE
SV  NAME      PLEX     VOLNAME     NVOLLAYR     LENGTH     [COL/]OFF      AM/NM      MODE
...
v  r5vol      -     ENABLED     NEEDSYNC     204800     RAID      -      raid5
pl  r5vol-01      r5vol     ENABLED     ACTIVE     204800     RAID      3/16      RW
sd  disk01-01      r5vol-01   disk01       0     102400     0/0      c2t9d0      ENA
sd  disk02-01      r5vol-01   disk02       0     102400     1/0      c2t10d0      dS
sd  disk03-01      r5vol-01   disk03       0     102400     2/0      c2t11d0      ENA
...

This output lists the volume state as NEEDSYNC, indicating that the parity needs to be resynchronized. The state could also have been SYNC, indicating that a synchronization was attempted at start time and that a synchronization process should be doing the synchronization. If no such process exists or if the volume is in the NEEDSYNC state, a synchronization can be manually started by using the resync keyword for the vxvol command. For example, to resynchronize the RAID-5 volume in the figure Invalid RAID-5 Volume, use the following command:

# vxvol -g mydg resync r5vol

Parity is regenerated by issuing VOL_R5_RESYNC ioctls to the RAID-5 volume. The resynchronization process starts at the beginning of the RAID-5 volume and resynchronizes a region equal to the number of sectors specified by the -o iosize option. If the -o iosize option is not specified, the default maximum I/O size is used. The resync operation then moves onto the next region until the entire length of the RAID-5 volume has been resynchronized.

For larger volumes, parity regeneration can take a long time. It is possible that the system could be shut down or crash before the operation is completed. In case of a system shutdown, the progress of parity regeneration must be kept across reboots. Otherwise, the process has to start all over again.

To avoid the restart process, parity regeneration is checkpointed. This means that the offset up to which the parity has been regenerated is saved in the configuration database. The -o checkpt=size option controls how often the checkpoint is saved. If the option is not specified, the default checkpoint size is used.

Because saving the checkpoint offset requires a transaction, making the checkpoint size too small can extend the time required to regenerate parity. After a system reboot, a RAID-5 volume that has a checkpoint offset smaller than the volume length starts a parity resynchronization at the checkpoint offset.

Log Plex Recovery

RAID-5 log plexes can become detached due to disk failures. These RAID-5 logs can be reattached by using the att keyword for the vxplex command. To reattach the failed RAID-5 log plex, use the following command:

# vxplex -g mydg att r5vol r5vol-l1

Stale Subdisk Recovery

Stale subdisk recovery is usually done at volume start time. However, the process doing the recovery can crash, or the volume may be started with an option such as -o delayrecover that prevents subdisk recovery. In addition, the disk on which the subdisk resides can be replaced without recovery operations being performed. In such cases, you can perform subdisk recovery using the vxvol recover command. For example, to recover the stale subdisk in the RAID-5 volume shown in the figure Invalid RAID-5 Volume, use the following command:

# vxvol -g mydg recover r5vol disk05-00

A RAID-5 volume that has multiple stale subdisks can be recovered in one operation. To recover multiple stale subdisks, use the vxvol recover command on the volume, as follows:

# vxvol -g mydg recover r5vol

Recovery After Moving RAID-5 Subdisks

When RAID-5 subdisks are moved and replaced, the new subdisks are marked as STALE in anticipation of recovery. If the volume is active, the vxsd command may be used to recover the volume. If the volume is not active, it is recovered when it is next started. The RAID-5 volume is degraded for the duration of the recovery operation.

Any failure in the stripes involved in the move makes the volume unusable. The RAID-5 volume can also become invalid if its parity becomes stale. To avoid this occurring, vxsd does not allow a subdisk move in the following situations:

a stale subdisk occupies any of the same stripes as the subdisk being moved
the RAID-5 volume is stopped but was not shut down cleanly; that is, the parity is considered stale
the RAID-5 volume is active and has no valid log areas

Only the third case can be overridden by using the -o force option.

Subdisks of RAID-5 volumes can also be split and joined by using the vxsd split command and the vxsd join command. These operations work the same way as those for mirrored volumes.

Note RAID-5 subdisk moves are performed in the same way as subdisk moves for other volume types, but without the penalty of degraded redundancy.

Starting RAID-5 Volumes

When a RAID-5 volume is started, it can be in one of many states. After a normal system shutdown, the volume should be clean and require no recovery. However, if the volume was not closed, or was not unmounted before a crash, it can require recovery when it is started, before it can be made available. This section describes actions that can be taken under certain conditions.

Under normal conditions, volumes are started automatically after a reboot and any recovery takes place automatically or is done through the vxrecover command.

Unstartable RAID-5 Volumes

A RAID-5 volume is unusable if some part of the RAID-5 plex does not map the volume length:

the RAID-5 plex cannot be sparse in relation to the RAID-5 volume length
the RAID-5 plex does not map a region where two subdisks have failed within a stripe, either because they are stale or because they are built on a failed disk

When this occurs, the vxvol start command returns the following error message:

VxVM vxvol ERROR V-5-1-1236 Volume r5vol is not startable; RAID-5 plex does not map entire volume length.

At this point, the contents of the RAID-5 volume are unusable.

Another possible way that a RAID-5 volume can become unstartable is if the parity is stale and a subdisk becomes detached or stale. This occurs because within the stripes that contain the failed subdisk, the parity stripe unit is invalid (because the parity is stale) and the stripe unit on the bad subdisk is also invalid. The figure, Invalid RAID-5 Volume, illustrates a RAID-5 volume that has become invalid due to stale parity and a failed subdisk.

Invalid RAID-5 Volume

Click the thumbnail above to view full-sized image.

This example shows four stripes in the RAID-5 array. All parity is stale and subdisk disk05-00 has failed. This makes stripes X and Y unusable because two failures have occurred within those stripes.

This qualifies as two failures within a stripe and prevents the use of the volume. In this case, the output display from the vxvol start command is as follows:

VxVM vxvol ERROR V-5-1-1237 Volume r5vol is not startable; some subdisks are unusable and the parity is stale.

This situation can be avoided by always using two or more RAID-5 log plexes in RAID-5 volumes. RAID-5 log plexes prevent the parity within the volume from becoming stale which prevents this situation (see System Failures for details).

Forcibly Starting RAID-5 Volumes

You can start a volume even if subdisks are marked as stale: for example, if a stopped volume has stale parity and no RAID-5 logs, and a disk becomes detached and then reattached.

The subdisk is considered stale even though the data is not out of date (because the volume was in use when the subdisk was unavailable) and the RAID-5 volume is considered invalid. To prevent this case, always have multiple valid RAID-5 logs associated with the array whenever possible.

To start a RAID-5 volume with stale subdisks, you can use the -f option with the vxvol start command. This causes all stale subdisks to be marked as non-stale. Marking takes place before the start operation evaluates the validity of the RAID-5 volume and what is needed to start it. Also, you can mark individual subdisks as non-stale by using the following command:

# vxmend [-g diskgroup] fix unstale subdisk

If some subdisks are stale and need recovery, and if valid logs exist, the volume is enabled by placing it in the ENABLED kernel state and the volume is available for use during the subdisk recovery. Otherwise, the volume kernel state is set to DETACHED and it is not available during subdisk recovery.

This is done because if the system were to crash or the volume was ungracefully stopped while it was active, the parity becomes stale, making the volume unusable. If this is undesirable, the volume can be started with the -o unsafe start option.

Caution

The -o unsafe start option is considered dangerous, as it can make the contents of the volume unusable. It is therefore not recommended.

The volume state is set to RECOVER and stale subdisks are restored. As the data on each subdisk becomes valid, the subdisk is marked as no longer stale.

If any subdisk recovery fails and there are no valid logs, the volume start is aborted because the subdisk remains stale and a system crash makes the RAID-5 volume unusable. This can also be overridden by using the -o unsafe start option.

Caution

The -o unsafe start option is considered dangerous, as it can make the contents of the volume unusable. It is therefore not recommended.

If the volume has valid logs, subdisk recovery failures are noted but they do not stop the start procedure.

When all subdisks have been recovered, the volume is placed in the ENABLED kernel state and marked as ACTIVE. It is now started.


^ Return to Top	< Previous \| Next >

Product: Volume Manager Guides
Manual: Volume Manager 4.1 Troubleshooting Guide
VERITAS Software Corporation www.veritas.com