Recovery

Primary-Host Crash

When a Primary host recovers from a failure, VVR automatically recovers the RVG configuration. When the Primary recovers, VVR recovers the Primary SRL and all volumes in the RVG. Information about the recent activity on the SRL and the data volume is maintained in the SRL header. VVR uses this information to speed up recovery, which is automatic on reboot.

Recovering from Primary Data Volume Error

If there is an error during access of a Primary data volume, the data volume is detached, the RVG is disabled, and the RVG state changes to FAIL, which means that the RVG is unusable. RLINKs are not affected. If the SRL was not empty at the time of the volume error, those updates will continue to flow from the SRL to the Secondary RLINKs.

Recovery from this failure consists of two parts:

Restoring the Primary data volume from backup
Resynchronizing any Secondary RLINKs

If the RVG contains a database, recovery of the failed data volume must be coordinated with the recovery requirements of the database. The details of the database recovery sequence determine what must be done to synchronize Secondary RLINKs.

Two examples are given below.

Example 1

In this example, all the RLINKs are detached before recovery of the failure begins on the Primary. When recovery of the failure is complete, including any database recovery procedures, all the RLINKs must be synchronized using a Primary checkpoint.

On the Primary (seattle):

Detach all RLINKs
# vxrlink -g hrdg det rlk_london_hr_rvg
Fix or repair the data volume.
If the data volume can be repaired by repairing its underlying subdisks, you need not dissociate the data volume from the RVG. If the problem is fixed by dissociating the failed volume and associating a new one in its place, the dissociation and association must be done while the RVG is stopped.
Make sure the data volume is started before restarting the RVG.
# vxvol -g hrdg start hr_dv01
# vxrvg -g hrdg start hr_rvg
Restore the database.
Synchronize all the RLINKs using block-level backup and checkpointing.

Example 2

This example does the minimum to clear the RVG FAIL state leaving all RLINKs attached. In this example, restoring the failed volume data from backup, and the database recovery is done with live RLINKs. Because all the changes on the Primary are replicated, all the Secondaries must be consistent with the Primary after the changes have been replicated. This method may not always be practical because it might require replication of large amounts of data. The repaired data volume must also be carefully tested on every target database to be supported.

On the Primary (seattle):

Stop the RVG.
# vxrvg -g hrdg stop hr_rvg
Dissociate the failed data volume from the RVG.
Fix or repair the data volume or use a new volume.
If the data volume can be repaired by repairing its underlying subdisks, you need not dissociate the data volume from the RVG. If the problem is fixed by dissociating the failed volume and associating a new one in its place, the dissociation and association must be done while the RVG is stopped.
Associate the volume with the RVG.
Make sure the data volume is started before restarting the RVG. If the data volume is not started, start the data volume:
# vxvol -g hrdg start hr_dv01
Start the RVG:
# vxrvg -g hrdg start hr_rvg
Restore the database.

As an alternative to the procedures described in Example 1 and Example 2, the Primary role can be transferred to a Secondary host. For details, see Chapter 7, Transferring the Primary Role.

Primary SRL Volume Error Cleanup and Restart

If there is an error accessing the Primary SRL, the SRL is dissociated and the RLINKs are detached. The state of the Primary and Secondary RLINKs is changed to STALE. The RVG state does not change, but the RVG is put into PASSTHRU mode that allows update of the Primary volume to continue until the error is fixed. See RVG PASSTHRU Mode.

The SRL must be repaired manually and then associated with the RVG. While the SRL is being repaired, no attempt is made to send data to the RLINKs. After the SRL is replaced, all RLINKs must be completely synchronized. Attach the RLINKs and perform a complete synchronization of the Secondaries.

On the Primary (seattle):

To cleanup after a Primary SRL error

Dissociate the SRL from the RVG.
# vxvol -g hrdg dis hr_srl
Fix or replace the SRL volume.
Make sure that the repaired SRL is started before associating it with the RVG. If the repaired SRL is not started, start it:
# vxvol -g hrdg start hr_srl
Associate a new SRL with the RVG. After associating the new SRL, the RVG PASSTHRU mode no longer displays in the output of the command vxprint -lV.
# vxvol -g hrdg aslog hr_rvg hr_srl
Completely synchronize the Secondary. See Synchronizing the Secondary and Starting Replication for details.

RVG PASSTHRU Mode

Typically, writes to data volumes associated with an RVG go to the RVG's SRL first, and then to the RLINKs and data volumes. If the Primary SRL is ever detached because of an access error, then the Primary RVG is put into PASSTHRU mode. In PASSTHRU mode, writes to the data volume are passed directly to the underlying data volume, bypassing the SRL. No RLINKs receive the writes. Use vxprint -l on the RVG to see if the passthru flag is set. Associating a new SRL will clear PASSTHRU mode, and the Secondary node RVGs must be synchronized.

Primary SRL Volume Error at Reboot

If the Primary SRL has an error during reboot, there is a possibility that the disks or arrays containing the SRL have not yet come online. Because of this, instead of placing the RVG in PASSTHRU mode, VVR does not recover the RVG. When the SRL becomes available, issue the following commands to recover the RVG and the RLINK:

# vxrvg -g diskgroup recover rvg_name
# vxrlink -g diskgroup recover rlink_name

After this error has occurred and you have successfully recovered the RVG, if you dissociate a volume from the RVG, you may see the following message:

Because there could be outstanding writes in the SRL, the data volume being dissociated should be considedred out-of-date and inconsistent

You can ignore this message.

If the SRL is permanently lost, create a new SRL as described in Recovering from SRL Header Error. In this case, it is possible that writes that had succeeded on the old SRL and acknowledged to the application, were not yet flushed to the data volumes and are now lost. Consequently, you must restore the data volumes from backup before proceeding. Because this causes the data volumes to be completely rewritten, it is recommended that you detach the RLINKs and synchronize them after the restore operation is complete.

Primary SRL Volume Overflow Recovery

Because the size of the Primary SRL is finite, prolonged halts in update activity to any RLINK can exceed the log's ability to maintain all the necessary update history to bring an RLINK up-to-date. When this occurs, the RLINK in question is marked as STALE and requires manual recovery before replication can proceed. A STALE RLINK can only be brought up-to-date by using automatic synchronization or a block-level backup and checkpoint. The other RLINKs, the RVG, and the SRL volume are all still operational.

SRL overflow protection can be set up to prevent SRL overflow, and is the default. Instead of allowing the RLINK to become STALE, dcm logging is initiated. At a later time when the communication link is not overloaded, you can incrementally resynchronize the RLINK using the vradmin resync rvg command.

Primary SRL Header Error Cleanup and Recovery

An SRL header failure on the Primary is a serious error. All RLINKs are lost and must be recovered using a Primary checkpoint. Because information about data volume errors is kept in the SRL header, the correct status of data volumes cannot be guaranteed under all occurrences of this error. For this reason, we recommend that the SRL be mirrored.

If an SRL header error occurs during normal operation and you notice it before a reboot occurs, you can be certain that any data volumes that have also (simultaneously) failed will have a status of DETACHED. If the system is rebooted before the vxprint command shows the volumes to be in the DETACHED state, the status of any failed data volumes may be lost. Both these cases involve multiple errors and are unlikely, but it is important to understand that the state of Primary data volumes can be suspect with this type of error.

When a Primary SRL header error occurs, writes to the RVG continue; however, all RLINKs are put in the STALE state. The RVG is operating in PASSTHRU mode.

Recovering from SRL Header Error

To recover from an SRL header error, dissociate the SRL from the RVG, repair the SRL, and completely synchronize all the RLINKs.

Stop the RVG.
# vxrvg -g hrdg stop hr_rvg
Dissociate the SRL from the RVG.
# vxvol -g hrdg dis hr_srl
Repair or restore the SRL. Even if the problem can be fixed by repairing the underlying subdisks, the SRL must still be dissociated and reassociated to initialize the SRL header.
Make sure the SRL is started, and then reassociate the SRL:
# vxvol -g hrdg start hr_srl
# vxvol -g hrdg aslog hr_rvg hr_srl
Start the RVG:
# vxrvg -g hrdg start hr_rvg
Restore the data volumes from backup if needed. Synchronize all the RLINKs. See Methods to Synchronize the Secondary.

Secondary Data Volume Error Cleanup and Recovery

If an I/O error occurs during access of a Secondary data volume, the data volume is automatically detached from the RVG and the RLINKs are disconnected. A subsequent attempt by the Primary to connect to the Secondary fails and a message that the Secondary volumes are stopped is displayed. The Primary is unaffected and writes continue to be logged into the SRL. After the Secondary data volume error is fixed and the data volume is started, the RLINKs automatically reconnect.

If there is no suitable Primary or Secondary checkpoint, detach the RLINKs on both the Primary and Secondary, and then synchronize the RLINKs. See Restoring the Secondary from Online Backup for details.

Recovery Using a Secondary Checkpoint

This section explains how to recover from a Secondary data volume error using a Secondary checkpoint.

On the Secondary (london):

Repair the failed data volume. You need not dissociate the data volume if the problem can be fixed by repairing the underlying subdisks.
Make sure that the data volume is started:
# vxvol -g hrdg start hr_dv01
Restore data from the Secondary checkpoint backup to all the volumes. If all volumes are restored from backup, the Secondary will remain consistent during the synchronization. Restore the RLINK by issuing the following command:
# vxrlink -g hrdg -c sec_chkpt restore rlk_seattle_hr_rvg

Cleanup Using a Primary Checkpoint

On the Secondary (london):

Repair the failed data volume as above. Be sure that the data volume is started before proceeding:
# vxvol -g hrdg start hr_dv01
Detach the RLINK to enable writing to the Secondary data volumes:
# vxrlink -g hrdg det rlk_seattle_hr_rvg
Restore data from the Primary checkpoint backup to all data volumes. Unlike restoration from a Secondary checkpoint, the Primary checkpoint data must be loaded onto all Secondary data volumes, not just the failed volume. If a usable Primary checkpoint does not already exist, make a new checkpoint. For details, see Example---Synchronizing the Secondary Using Block-level Backup.
Reattach the RLINK.
# vxrlink -g hrdg att rlk_seattle_hr_rvg

On the Primary (seattle):

Detach the RLINK and then reattach it from the Primary checkpoint using the following commands:

# vxrlink -g hrdg det rlk_london_hr_rvg
# vxrlink -g hrdg -c primary_checkpoint att rlk_london_hr_rvg

Secondary SRL Volume Error Cleanup and Recovery

The Secondary SRL is used only during atomic recovery of an RLINK and when an IBC is active. If I/O errors occur during recovery of the Secondary SRL, the recovery fails, the SRL volume is automatically detached, and the RLINK is forced to the PAUSE state. Manual intervention is required to repair the physical problem, reattach the SRL, and resume the RLINK. Upon resumption, an automatic recovery of the RVG is retried and if it succeeds, update activity can continue. The only problem occurs if the Primary SRL overflows before the repair is complete, in which case a full synchronization is required.

If an error occurs in the data portion of the SRL, the RLINK is forced to the PAUSE state with the secondary_paused flag set. The SRL is not dissociated.

If an error occurs in the SRL header, the Secondary RVG is forced to the FAIL state and the SRL is dissociated.

On the Secondary (london):

Dissociate the SRL, fix it, and then re-associate it. The dissociation and re-association is necessary even if the problem can be fixed by repairing the underlying subdisks because this sequence initializes the SRL header.
  # vxvol -g hrdg dis hr_srl
Fix or replace the SRL. Be sure the SRL is started before associating it:

  # vxvol -g hrdg start hr_srl
  # vxvol -g hrdg aslog hr_rvg hr_srl
Run the RLINK resume operation to clear the secondary_log_err flag.
# vxrlink -g hrdg resume rlk_seattle_hr_rvg

Secondary SRL Header Error Cleanup and Recovery

An SRL header failure on the Secondary puts the Secondary RVG into the FAIL state, and sets the RLINK state to the PAUSE state on both the Primary and Secondary. Because information about data volume errors is kept in the SRL header, the correct state of data volumes is not guaranteed in all cases. If a Secondary SRL header failure occurs during normal operation and is noticed before a reboot occurs, any data volumes that also failed will have a state of DETACHED. If the system is rebooted before the vxprint command shows the volumes to be in the DETACHED state, the status of any failed data volumes may be lost. Both these cases involve multiple errors and are unlikely, but it is important to understand that the state of Secondary data volumes can be suspect with this type of error.

Dissociate the SRL volume.
# vxvol -g hrdg dis hr_srl
Repair the SRL volume. Even if the problem can be fixed by repairing the underlying subdisks, the SRL volume must still be dissociated and re-associated to initialize the SRL header.
Start the SRL volume. Then, re-associate it.
# vxvol -g hrdg start hr_srl
# vxvol -g hrdg aslog hr_rvg hr_srl
Start the RVG.
# vxrvg -g hrdg start hr_rvg
If the integrity of the data volumes is not suspect, just resume the RLINK.
  # vxrlink -g hrdg resume rlk_seattle_hr_rvg
OR

If the integrity of the data volumes is suspect, and a Secondary checkpoint backup is
available, restore from the Secondary checkpoint.

  # vxrlink -g hrdg det rlk_seattle_hr_rvg
  # vxrlink -g hrdg -f att rlk_seattle_hr_rvg
  # vxrlink -g hrdg -w pause rlk_seattle_hr_rvg
Restore the Secondary checkpoint backup data on to the data volumes.

# vxrlink -g hrdg -c secondary_checkpoint restore rlk_seattle_hr_rvg
OR

If the integrity of the data volumes is suspect and no Secondary checkpoint is available, synchronize the Secondary using a block-level backup and Primary checkpoint (see Example---Synchronizing the Secondary Using Block-level Backup) or automatic synchronization.

  # vxrlink -g hrdg det rlk_seattle_hr_rvg
On the Secondary, restore the Primary checkpoint backup data to the data volumes.

  # vxrlink -g hrdg -f att rlk_seattle_hr_rvg
On the Primary (seattle):

  # vxrlink -g hrdg -c primary_checkpoint att rlk_london_hr_rvg

Secondary SRL Header Error at Reboot

If the secondary SRL has an error after a reboot, it is not possible to recover it, even if the SRL subsequently becomes available. Ignore the following message:

VxVM VVR vxrvg ERROR V-5-1-0 RVG rvg_name cannot be recovered because SRL is not accessible. Try recovering the RVG after the SRL becomes available using vxrecover -s command

Dissociate the SRL:
# vxvol -g hrdg -f dis srl
Ignore the following messages:

VxVM vxvol WARNING V-5-1-0 WARNING: Rvg rvgname has not been recovered because the SRL is not available. The data volumes may be out-of-date and inconsistent
VxVM vxvol WARNING V-5-1-0 The data volumes in the rvg rvgname cannot be recovered because the SRL is being dissociated. Restore the data volumes from backup before starting the applications
Create a new SRL volume, new_srl and continue as follows:
  # vxvol -g hrdg aslog rvg_name new_srl
  # vxrlink -g hrdg recover rlink_name
  # vxrlink -g hrdg -f att rlink_name
  # vxrvg -g hrdg start rvg_name

If replication was frozen due to receipt of an IBC, the data in the SRL is lost but there is no indication of this problem. To see whether this was the case, examine the /var/adm/syslog/syslog.log file for a message such as:

WARNING: VxVM VVR vxio V-5-0-0 Replication frozen for rlink <rlink>

If this is the last message for the RLINK, that is, if there is no subsequent message stating that replication was unfrozen, the Primary RLINK must be completely resynchronized.


^ Return to Top	< Previous \| Next >

Product: Volume Replicator Guides
Manual: Volume Replicator 4.1 Administrator's Guide
VERITAS Software Corporation www.veritas.com