| ||||||||||
Note even if the device errors are considered transient, it still may have caused uncorrectable data errors within the pool. These errors require special repair procedures, even if the underlying device is deemed healthy or otherwise repaired. For more information on repairing data errors, see 9.7 Repairing Damaged Data. 9.6.2 Clearing Transient ErrorsIf the errors seen are deemed transient, in that they are unlikely to effect the future health of the device, then the device errors can be safely cleared to indicate that there was no fatal error. To clear a device of any errors, simply online the device using the zpool online command:
This syntax clears any errors associated with the device. For more information on onlining devices, see 4.5.2.2 Bringing a Device Online. 9.6.3 Replacing a DeviceIf device damage is permanent, or future permanent damage is likely, the device needs to be replaced. Whether or not the device can be replaced depends on the configuration. 9.6.3.1 Determining if a Device can be ReplacedIn order for a device to be replaced, the pool must be in the ONLINE state, and the device must be part of a replicated configuration, or it must be healthy (in the ONLINE state). If the disk is part of a replicated configuration, there must be sufficient replicas from which to retrieve good data. If two disks in a four-way mirror are faulted, then either can be replaced since there are healthy replicas. On the other hand, if two disks in a four-way RAID-Z device are faulted, then neither can be replaced since there are not enough replicas from which to retrieve data. If the device is damaged but otherwise online, it can be replaced as long as the pool is not in the FAULTED state, though any bad data on the device is copied to the new device unless there are sufficient replicas with good data. In the following configuration:
The disk c0t0d1 can be replaced, and any data in the pool is copied from the good replica, c0t0d0. The disk c0t0d0 can also be replaced, though no self-healing of data can take place since there is no good replica available. In the following configuration:
Neither of the faulted disks can be replaced. The ONLINE disks cannot be replaced either, since the pool itself is faulted in this case. In the following configuration:
Either top level disk can be replaced, though any bad data present on the disk is copied to the new disk. If either disk were faulted, then no replacement could be done since the pool itself would be faulted. 9.6.3.2 Unreplaceable DevicesIf a loss of device causes the pool to become faulted, or the device contains too many data errors in an unreplicated configuration, then it cannot safely be replaced. Without sufficient replicas, there is no good data with which to heal the damaged device. In this case, the only option is to destroy the pool and recreate the configuration, restoring your data in the process. For more information on restoring an entire pool, see 9.7.3 Repairing Pool Wide Damage. 9.6.3.3 Replacing a DeviceOnce it has been determined that a device can be replaced, simply use the zpool replace command. If you are replacing the damaged device with another different device, use the following command:
This command begins migrating data to the new device from the damaged device, or other devices in the pool if it is in a replicated configuration. When it is finished, it detaches the damaged device from the configuration, at which point it can be removed from the system. If you have already removed the device and replaced it with a new device in the same location, use the single device form of the command:
This command takes an unformatted disk, formats it appropriately, and then begins resilvering data from the rest of the configuration. For more information on the zpool replace command, see 4.5.3 Replacing Devices. 9.6.3.4 Viewing Resilvering StatusThe process of replacing a drive can take an extended period of time, depending on the size of the drive and the amount of data in the pool. The process of moving data from one device to another is known as resilvering, and can be monitored via the zpool status command. Traditional filesystems resilver data at the block level. Since ZFS eliminates the artificial layering of the volume manager, it is capable of performing resilvering in a much more powerful and controlled manner. The two main advantages are:
To view the resilvering process, use the zpool status command:
In the above example, the disk c0t0d0 is being replaced by c0t0d2. This can be seen with the introduction of the replacing virtual device in the configuration. This is not a real device, nor is it possible for the user to create a pool using this virtual device type. Its purpose is solely to display the resilvering process, and to identify exactly which device is being replaced. Note that any pool currently undergoing resilvering is placed in the DEGRADED state, because the pool is not capable of providing the desired replication level until the resilvering process is complete. Resilvering always proceeds as fast as possible, though the I/O is always be scheduled with lower priority than user-requested I/O, to minimize impact on the system. Once the resilvering is complete, the configuration reverts to the new, complete, config:
The pool is once again ONLINE, and the original bad disk (c0t0d0) has been removed from the configuration. 9.7 Repairing Damaged DataZFS uses checksumming, replication, and self-healing data to minimize the chances of data corruption. Even still data corruption can occur if the pool isn't replicated, corruption occurred while the pool was degraded, or an unlikely series of events conspired to corrupt multiple copies of a piece of data. Regardless of the source, the result is the same: the data is corrupted and therefore no longer accessible. The action taken depends on the type of data being corrupted, and its relative value. There are two basic types of data that can be corrupted:
Data is verified during normal operation as well as through scrubbing. For more information on how to verify the integrity of pool data, see 9.2 Checking Data Integrity. 9.7.1 Identifying Type of Data CorruptionBy default, the zpool status command shows only the fact that corruption has occurred, without specifics on where this corruption was seen:
With the -v option, a complete list of errors is given:
| ||||||||||
| ||||||||||