< Previous | Next > | |
Product: Volume Replicator Guides | |
Manual: Volume Replicator 4.1 Administrator's Guide |
RecoveryPrimary-Host CrashWhen a Primary host recovers from a failure, VVR automatically recovers the RVG configuration. When the Primary recovers, VVR recovers the Primary SRL and all volumes in the RVG. Information about the recent activity on the SRL and the data volume is maintained in the SRL header. VVR uses this information to speed up recovery, which is automatic on reboot. Recovering from Primary Data Volume ErrorIf there is an error during access of a Primary data volume, the data volume is detached, the RVG is disabled, and the RVG state changes to FAIL, which means that the RVG is unusable. RLINKs are not affected. If the SRL was not empty at the time of the volume error, those updates will continue to flow from the SRL to the Secondary RLINKs. Recovery from this failure consists of two parts:
If the RVG contains a database, recovery of the failed data volume must be coordinated with the recovery requirements of the database. The details of the database recovery sequence determine what must be done to synchronize Secondary RLINKs. In this example, all the RLINKs are detached before recovery of the failure begins on the Primary. When recovery of the failure is complete, including any database recovery procedures, all the RLINKs must be synchronized using a Primary checkpoint.
This example does the minimum to clear the RVG FAIL state leaving all RLINKs attached. In this example, restoring the failed volume data from backup, and the database recovery is done with live RLINKs. Because all the changes on the Primary are replicated, all the Secondaries must be consistent with the Primary after the changes have been replicated. This method may not always be practical because it might require replication of large amounts of data. The repaired data volume must also be carefully tested on every target database to be supported.
As an alternative to the procedures described in Example 1 and Example 2, the Primary role can be transferred to a Secondary host. For details, see Chapter 7, Transferring the Primary Role. Primary SRL Volume Error Cleanup and RestartIf there is an error accessing the Primary SRL, the SRL is dissociated and the RLINKs are detached. The state of the Primary and Secondary RLINKs is changed to STALE. The RVG state does not change, but the RVG is put into PASSTHRU mode that allows update of the Primary volume to continue until the error is fixed. See RVG PASSTHRU Mode. The SRL must be repaired manually and then associated with the RVG. While the SRL is being repaired, no attempt is made to send data to the RLINKs. After the SRL is replaced, all RLINKs must be completely synchronized. Attach the RLINKs and perform a complete synchronization of the Secondaries. To cleanup after a Primary SRL error
RVG PASSTHRU ModeTypically, writes to data volumes associated with an RVG go to the RVG's SRL first, and then to the RLINKs and data volumes. If the Primary SRL is ever detached because of an access error, then the Primary RVG is put into PASSTHRU mode. In PASSTHRU mode, writes to the data volume are passed directly to the underlying data volume, bypassing the SRL. No RLINKs receive the writes. Use vxprint -l on the RVG to see if the passthru flag is set. Associating a new SRL will clear PASSTHRU mode, and the Secondary node RVGs must be synchronized. Primary SRL Volume Error at RebootIf the Primary SRL has an error during reboot, there is a possibility that the disks or arrays containing the SRL have not yet come online. Because of this, instead of placing the RVG in PASSTHRU mode, VVR does not recover the RVG. When the SRL becomes available, issue the following commands to recover the RVG and the RLINK: # vxrvg -g diskgroup recover rvg_name # vxrlink -g diskgroup recover rlink_name After this error has occurred and you have successfully recovered the RVG, if you dissociate a volume from the RVG, you may see the following message: Because there could be outstanding writes in the SRL, the data volume being dissociated should be considedred out-of-date and inconsistent If the SRL is permanently lost, create a new SRL as described in Recovering from SRL Header Error. In this case, it is possible that writes that had succeeded on the old SRL and acknowledged to the application, were not yet flushed to the data volumes and are now lost. Consequently, you must restore the data volumes from backup before proceeding. Because this causes the data volumes to be completely rewritten, it is recommended that you detach the RLINKs and synchronize them after the restore operation is complete. Primary SRL Volume Overflow RecoveryBecause the size of the Primary SRL is finite, prolonged halts in update activity to any RLINK can exceed the log's ability to maintain all the necessary update history to bring an RLINK up-to-date. When this occurs, the RLINK in question is marked as STALE and requires manual recovery before replication can proceed. A STALE RLINK can only be brought up-to-date by using automatic synchronization or a block-level backup and checkpoint. The other RLINKs, the RVG, and the SRL volume are all still operational. SRL overflow protection can be set up to prevent SRL overflow, and is the default. Instead of allowing the RLINK to become STALE, dcm logging is initiated. At a later time when the communication link is not overloaded, you can incrementally resynchronize the RLINK using the vradmin resync rvg command. Primary SRL Header Error Cleanup and RecoveryAn SRL header failure on the Primary is a serious error. All RLINKs are lost and must be recovered using a Primary checkpoint. Because information about data volume errors is kept in the SRL header, the correct status of data volumes cannot be guaranteed under all occurrences of this error. For this reason, we recommend that the SRL be mirrored. If an SRL header error occurs during normal operation and you notice it before a reboot occurs, you can be certain that any data volumes that have also (simultaneously) failed will have a status of DETACHED. If the system is rebooted before the vxprint command shows the volumes to be in the DETACHED state, the status of any failed data volumes may be lost. Both these cases involve multiple errors and are unlikely, but it is important to understand that the state of Primary data volumes can be suspect with this type of error. When a Primary SRL header error occurs, writes to the RVG continue; however, all RLINKs are put in the STALE state. The RVG is operating in PASSTHRU mode. Recovering from SRL Header ErrorTo recover from an SRL header error, dissociate the SRL from the RVG, repair the SRL, and completely synchronize all the RLINKs.
Secondary Data Volume Error Cleanup and RecoveryIf an I/O error occurs during access of a Secondary data volume, the data volume is automatically detached from the RVG and the RLINKs are disconnected. A subsequent attempt by the Primary to connect to the Secondary fails and a message that the Secondary volumes are stopped is displayed. The Primary is unaffected and writes continue to be logged into the SRL. After the Secondary data volume error is fixed and the data volume is started, the RLINKs automatically reconnect. If there is no suitable Primary or Secondary checkpoint, detach the RLINKs on both the Primary and Secondary, and then synchronize the RLINKs. See Restoring the Secondary from Online Backup for details. Recovery Using a Secondary CheckpointThis section explains how to recover from a Secondary data volume error using a Secondary checkpoint.
Cleanup Using a Primary Checkpoint
Detach the RLINK and then reattach it from the Primary checkpoint using the following commands: # vxrlink -g hrdg det rlk_london_hr_rvg # vxrlink -g hrdg -c primary_checkpoint att rlk_london_hr_rvg Secondary SRL Volume Error Cleanup and RecoveryThe Secondary SRL is used only during atomic recovery of an RLINK and when an IBC is active. If I/O errors occur during recovery of the Secondary SRL, the recovery fails, the SRL volume is automatically detached, and the RLINK is forced to the PAUSE state. Manual intervention is required to repair the physical problem, reattach the SRL, and resume the RLINK. Upon resumption, an automatic recovery of the RVG is retried and if it succeeds, update activity can continue. The only problem occurs if the Primary SRL overflows before the repair is complete, in which case a full synchronization is required. If an error occurs in the data portion of the SRL, the RLINK is forced to the PAUSE state with the secondary_paused flag set. The SRL is not dissociated. If an error occurs in the SRL header, the Secondary RVG is forced to the FAIL state and the SRL is dissociated.
Secondary SRL Header Error Cleanup and RecoveryAn SRL header failure on the Secondary puts the Secondary RVG into the FAIL state, and sets the RLINK state to the PAUSE state on both the Primary and Secondary. Because information about data volume errors is kept in the SRL header, the correct state of data volumes is not guaranteed in all cases. If a Secondary SRL header failure occurs during normal operation and is noticed before a reboot occurs, any data volumes that also failed will have a state of DETACHED. If the system is rebooted before the vxprint command shows the volumes to be in the DETACHED state, the status of any failed data volumes may be lost. Both these cases involve multiple errors and are unlikely, but it is important to understand that the state of Secondary data volumes can be suspect with this type of error.
Secondary SRL Header Error at RebootIf the secondary SRL has an error after a reboot, it is not possible to recover it, even if the SRL subsequently becomes available. Ignore the following message: VxVM VVR vxrvg ERROR V-5-1-0 RVG rvg_name cannot be recovered because SRL is not accessible. Try recovering the RVG after the SRL becomes available using vxrecover -s command
If replication was frozen due to receipt of an IBC, the data in the SRL is lost but there is no indication of this problem. To see whether this was the case, examine the /var/adm/syslog/syslog.log file for a message such as: WARNING: VxVM VVR vxio V-5-0-0 Replication frozen for rlink <rlink> If this is the last message for the RLINK, that is, if there is no subsequent message stating that replication was unfrozen, the Primary RLINK must be completely resynchronized. |
^ Return to Top | < Previous | Next > |
Product: Volume Replicator Guides | |
Manual: Volume Replicator 4.1 Administrator's Guide | |
VERITAS Software Corporation
www.veritas.com |