Failing Back to the Original Primary
After an unexpected failure, a failed Primary host might start up to find that one of its Secondaries has been promoted to a Primary by a takeover. This happens when a Secondary of this Primary has taken over the Primary role because of the unexpected outage on this Primary. The process of transferring the role of the Primary back to this original Primary is called failback.
VVR provides the following methods to fail back to the original Primary:
Fast Failback Versus Difference-Based Synchronization
In the case of fast failback, the data blocks that changed while the original Primary was unavailable are tracked using the DCM for each volume. Difference-based synchronization computes MD5 checksums for a fixed size data block on the Primary and Secondary data volumes, compares it, and then determines whether this data block needs to be transferred from the Primary data volume to the Secondary data volume. The fast failback feature is recommended over the difference-based synchronization for the following reasons:
- For difference-based synchronization, all the blocks on all the Primary and Secondary data volumes are read; in the case of fast failback, only the blocks that changed on the new Primary are read and hence the number of read operations required is smaller.
- For difference-based synchronization, the differences are determined by computing and comparing checksum of each of the data chunks on the Secondary and Primary; in the case of fast failback, there is no need to compute checksum because the differences are tracked as they happen, which makes fast failback faster.
The following sections describe each of the above methods for failing back to the original Primary.
Failing Back Using Fast Failback Synchronization
We recommend that you use the fast failback synchronization method. This procedure assumes that the fast failback feature was enabled on the new Primary when takeover was performed. Failing back to the original Primary using fast failback involves the following steps:
-
Converting the original Primary to an acting Secondary, as shown in the
vxprint -l rvgname output, and replaying the DCM or SRL of the original Primary to set bits in the DCM of the new Primary. This is performed automatically when the Primary recovers, unless fast failback was disabled during the takeover.
It is possible that the Primary and Secondary data volumes are not up-to-date because all updates to the original Primary might not have reached the Secondary before the takeover. The failback process takes care of these writes by replaying the SRL or DCM of the original Primary. After the original Primary detects that a takeover has occurred, the new Primary uses information in the DCM or SRL of the original Primary to set bits in its DCM for any blocks that changed on the original Primary before the takeover. You can use the vxrlink status command to monitor the progress of the DCM replay.
-
Converting the original Primary to Secondary and synchronizing the data volumes on the original Primary with the data volumes on the new Primary using the vradmin fbsync command. This command replays the failback log to synchronize the data volumes. The blocks that changed on the original Primary are resynchronized with the new Primary after the DCM of the new Primary is replayed. During the resynchronization, the data from the data volumes on the new Primary is transferred to the data volumes on the original Primary.
This step is not required if the -autofb option was used at the time of the takeover. The data on the original Primary data volumes is inconsistent for the duration of the replay. To keep a consistent copy of the original Primary data, take a snapshot of the data volumes before starting the replay. When using the vradmin fbsync command, you can also specify the cache or the cachesize option so that a space-optimized snapshot of the original Primary data volumes is automatically created. If the RVG on the original Primary has VxVM ISP volumes, then you cannot use the cachesize attribute.
-
Migrating the Primary Role back to the original Primary and starting replication.
In the following illustration, the original Primary seattle has recovered and is now the acting Secondary. The new Primary london uses information in the DCM or SRL of the original Primary to set bits in its DCM for any blocks that changed on the original Primary before the takeover.
Click the thumbnail above to view full-sized image.
In the following illustration, the fast failback feature was enabled on the new Primary london when takeover was performed.
The original Primary seattle is being resynchronized using the failback log.
Click the thumbnail above to view full-sized image.
Example 1---Failing Back to the Original Primary Using Fast Failback
In this example, the Primary host seattle has restarted after an unexpected failure. After the failure, the original Primary seattle was taken over by the Secondary host london. Each data volume on the Secondary london has a Data Change Map (DCM) associated with it. As a result, fast failback is enabled on london.
An application is running on london and incoming writes are being logged to its DCM. This example shows how to fail back to the original Primary seattle using the fast failback feature.
To fail back to the original Primary seattle using fast failback
-
Examine the original Primary and make sure you want to convert the original Primary to Secondary.
-
Convert the original Primary to Secondary and synchronize the data volumes in the original Primary RVG hr_rvg with the data volumes on the new Primary RVG hr_rvg on london using the fast failback feature. To synchronize the Secondary using fast failback, type the following command on the new Primary london or the original Primary seattle:
# vradmin -g hrdg [-wait] fbsync hr_rvg \
[cache=cacheobj | cachesize=size]
When the synchronization completes, go to the next step. You can check the status of the synchronization using the vxrlink status command. The -wait option with the vradmin fbsync command can also be used to wait for the completion of the synchronization process.
The cache attribute specifies a name for the precreated cache object, on which the snapshots for the volumes in the specified RVG will be created. For more information on creating the cache object refer to the information provided in the section Preparing the RVG Volumes for Snapshot Operation. The cachesize attribute specifies a default size for the cache object with respect to the source volume. You can specify only one of these attributes at one time with the vradmin fbsync to create one cache object for each snapshot.
The parameters cache and cachesize are optional. If you do not specify either of these parameters then the vradmin fbsync will convert the original Primary to a Secondary and synchronize the data volumes on the original Primary with the data volumes on the new Primary, without creating the snapshots.
This step is not required if the -autofb option was used at the time of takeover
-
At a convenient time, stop the application on the new Primary.
-
Migrate the Primary role from the new Primary host london to the original Primary host seattle by typing the following command on any host in the RDS:
# vradmin -g hrdg migrate hr_rvg seattle
Replication from the original Primary seattle to the original Secondary london is started by default.
-
Restart the application on the original Primary seattle. Because the application was stopped properly before the migration, an application recovery is not required.
Example 2---Failing Back to the Original Primary Using Fast Failback in a Multiple Secondaries Setup
In this example, the setup consists of two Secondaries, london and tokyo. The Primary host seattle has restarted after an unexpected failure. After the failure, the original Primary seattle was taken over by the Secondary host london. Each data volume on the Secondary london has a Data Change Map (DCM) associated with it. As a result, fast failback is enabled on london.
If you had created RLINKs between the hosts london and tokyo when setting up the replication, you do not need to manually reconfigure the additional Secondary tokyo as a part of the RDS. It will automatically be added as a Secondary of the new Primary london.
An application is running on london and incoming writes are being logged to its DCM. This example shows how to fail back to the original Primary seattle using the fast failback feature.
-
Perform step 1 to step 4 described in the example Example 1---Failing Back to the Original Primary Using Fast Failback.
-
After migration, you must synchronize the additional secondary tokyo with the original Primary seattle.
On the original Primary seattle:
- Synchronize the data volumes in the Secondary RVG hr_rvg on tokyo with the data volumes in the original Primary RVG hr_rvg using the difference-based synchronization and checkpoint. To do this, use the following command on any host in the RDS:
# vradmin -g hrdg -c checkpt syncrvg hr_rvg tokyo
The -c option when used with the vradmin syncrvg command automatically starts a checkpoint with the specified name, checkpt, in this example. After the data volumes are synchronized the checkpoint is ended.
- Start replication to tokyo using the checkpoint created above:
# vradmin -g hrdg -c checkpt startrep hr_rvg tokyo
-
Restart the application on the original Primary seattle. Because the application was stopped properly before the migration, an application recovery is not required.
Failing Back Using Difference-Based Synchronization
Failing back to the original Primary using difference-based synchronization involves the following steps:
-
Converting the original Primary to a Secondary of the new Primary.
-
Synchronizing the data volumes on the original Primary with the data volumes on the new Primary using difference-based synchronization using checkpoint.
-
Starting replication to the Secondary (original Primary) using checkpoint.
-
Migrating the Primary Role to the original Primary and starting replication by default.
The examples in the following section explain how to fail back to the original Primary using VVR.
Converting an Original Primary to a Secondary
VVR provides the vradmin makesec command to convert the original Primary to a Secondary. This command is only needed if fast failback was not enabled when the original takeover was performed. If fast failback was enabled, the original Primary will automatically be converted to a Secondary when DCM playback begins. Note that this command can be run only from the original Primary host where one of its original Secondaries has taken over the Primary role.
You can issue the vradmin makesec command in the failback procedure only if fast failback has not been enabled when taking over from the original Primary. Run this command when the original Primary restarts. Stop the application if it restarts automatically when the Primary restarts. The vradmin makesec command converts the original Primary to a Secondary RVG.
Tip
Before using the vradmin makesec command make sure you close all the applications running on the original Primary's data volumes. Also ensure that none of the data volumes are open.
The vradmin makesec command fails if the Secondary data volumes are not up-to-date or if there are some applications still running on the failed Primary's data volumes. Use the -f option to convert a failed Primary to a Secondary even when the Secondary data volumes are not up-to-date. If any of the failed Primary data volumes are open or have applications running on them, then using the vradmin makesec command with the -f option will fail. To proceed with the vradmin makesec command, first close the volumes or stop the applications as required.
To convert an original Primary to a Secondary:
# vradmin -g diskgroup makesec local_rvgname newprimary_name
The argument diskgroup is the disk group on the local host.
The argument local_rvgname is the name of the RVG on the local host, that is, the original Primary and represents its RDS.
The argument newprimary_name is the name of the new Primary host, that is, the previous Secondary host. Note that the newprimary_name argument must be the same as the host name displayed with the Primary-Primary configuration error in the output of the vradmin -l printrvg command.
Example 3---Failing Back to the Original Primary Using Difference-Based Synchronization
In this example, the Primary host seattle has restarted after an unexpected failure. After the failure, the original Primary seattle has been manually taken over by the Secondary host london. This example shows how to fail back to the original Primary seattle using difference-based synchronization. For more information, see Synchronizing Volumes Using Difference-Based Synchronization.
To fail back to the original Primary seattle
-
Make the original Primary RVG hr_rvg on seattle the Secondary RVG of the new Primary london by typing the following command on the original Primary seattle:
# vradmin -g hrdg makesec hr_rvg london
-
Synchronize the data volumes in the original Primary RVG hr_rvg with the data volumes in the new Primary RVG hr_rvg on london using the difference-based synchronization and checkpoint. To synchronize the Secondary based on differences using a checkpoint, type the following command on any host in the RDS:
# vradmin -g hrdg -c checkpt_presync syncrvg hr_rvg seattle
-
Stop the application on the new Primary london.
-
Start replication to the Secondary RVG (original Primary) hr_rvg on seattle from the new Primary RVG hr_rvg on london using the checkpoint by typing the following command on any host in the RDS:
# vradmin -g hrdg -c checkpt_presync startrep hr_rvg seattle
-
Migrate the Primary role from the new Primary host london to the original Primary host seattle by typing the following command on any host in the RDS:
# vradmin -g hrdg migrate hr_rvg seattle
Replication from the original Primary seattle to the original Secondary london is started by default.
-
Restart the application on the original Primary seattle. Because the application was stopped properly before the migration, an application recovery is not required.
Example 4---Failing Back to the Original Primary Using Difference-Based Synchronization in a Multiple Secondaries Setup
This example shows how to fail back to the original Primary seattle using the difference-based synchronization feature.
-
Perform step 1 to step 5 described in the example Example 3---Failing Back to the Original Primary Using Difference-Based Synchronization.
-
On the original Primary seattle:
- Synchronize the data volumes in the Secondary RVG hr_rvg on tokyo with the data volumes in the original Primary RVG hr_rvg using the difference-based synchronization and checkpoint. To do this, use the following command on any host in the RDS:
# vradmin -g hrdg -c checkpt syncrvg hr_rvg tokyo
The -c option when used with the vradmin syncrvg command automatically starts a checkpoint with the specified name, checkpt, in this example. After the data volumes are synchronized the checkpoint is ended.
After the data volumes are synchronized, the checkpoint is ended.
- Start replication from seattle to tokyo using the command:
# vradmin -g hrdg -c checkpt startrep hr_rvg tokyo
-
Restart the application on the original Primary seattle. Because the application was stopped properly before the migration, an application recovery is not required.
|