Table of content
Use of the ZFS file system. ZFS has several interesting caracteristics such as: raid, self healing, snapshot, clone, rollback, cache, deduplication, copy-on-write.
- Reference: Open ZFS
- Follow-up: ZFS debugging
Pool
Creation
The disk pool will be built according to the capacity and redundancy required, and of course the number of available disks. The different groups of disks that can be added to the pool are detailed below:
Group | Redundancy | RAID alternative |
---|---|---|
disk |
0 | RAID0 / JBOD |
mirror |
1 → n | RAID1 / Mirror |
raidz1 |
1 | RAID5 |
raidz2 |
2 | RAID6 |
raidz3 |
3 |
Having a redundancy of type raidz reduced performance due to parity
calculations required for distribution of data accros the disks. But
it helps protect a group against a loss of 1 disk (raidz1
), 2 disks
(raidz2
) or 3 disks (raidz3
) without the need to dedicate 50% of
the initial storage capacity for redundancy as it is the case in a
mirror (mirror
).
The advantage of having at least 2 disks for redundancy (type configuration raidz2 , raidz3) is to guard against cascading failures:
- if the disks are from the same batch, they may have the same defect, and thus fail at nearly the same time;
- the time required to rebuild the group render it vulnerable to the loss of additional disks, especially since this period tends to be longer with the current disk capacity (and can strech to a few days), and the intensive read stress the remaining disks.
Examples of pool creation (pool is called tank) consisting of a single group:
1 2 3 4 |
|
Available disk space on device replacement must be at least equal to the one it replaces. Therefore it may be desirable not to dedicate the entire disk to ZFS, but to protected ourself by creating a partition slightly smaller than the size of the disk. The announced capacity may indeed differ somehow depending on the manufacturers and references. This can even be the case on disks that should be identical: same manufacturer, same reference, same purchase date.
As it is hardly possible to be behind each array in case of problems, there is an option to mark some disks as spare, the spare will immediately replace the failed disk and start the repair process.
1 |
|
To improve ZIL performance, the ZIL (ZFS Intent Log) can be placed on a dedicated disk with better throughput and access time (typically on an SSD device). The ZIL is responsible for satisfing the POSIX requirement for read/write being synchronous.
1 2 |
|
Also for performance reasons, it is possible to add a cache consisting of one or more disks, here also, speed and low access time are paramount.
1 |
|
Verification
Conduct an audit of the pool (thanks to the checksums present on the
blocks), and repair it, if possible, from the redundancy available
(mirror
, raidz
, and multiple copies).
This is also an indirect way to restart an aborted resilvering
phase.
1 |
|
Displays the status of the pools and in case of problems the list the files impacted by the errors.
1 |
|
Performances
Provides information on the performance of the pool.
1 |
|
Disks replacement
Replacing disk is done by specifying the disk device to replace,
/dev/old_device
, and the replacing disk device /dev/new_device
:
1 |
|
In the case where the disk to replace is part of pool with redundancy, it is then possible to directly replace the disk by another one in the same place. The command is simplified to:
1 |
|
File systems
Logical volume
It is possible to use the ZFS pool not to create a file system, but a logical volume (ie: a raw block device ) that can be used later to create other ZFS pool, iSCSI disks, …
1 |
|
Compression
Several compression algorithms are available with different compression speed and ratio (see table below). But even if new algorithms have been introduced lately, the old one are kept for compatibility (allowing to read previously written compressed data).
Name | Description | Obsolete | Version |
---|---|---|---|
lzjb |
Initial ZFS high speed compression | ||
gzip |
High compression ratio (high CPU) | 5 | |
zle |
Compress runs of zero | 20 | |
lz4 |
Extremely fast compression | lzjb |
lz4_compress |
zstd |
Real-time compression at zlib-level | gzip |
zstd_compress |
Compression algorithm choice according to the compression and/or performance is usually between:
lz4
: fast compression (the default algorithm)zstd
: high compression ratio
Deduplication
Deduplication allows saving disk space by keeping only one or a few
copies (see dedupditto
) of identical blocks.
1 |
|
In the case where there is no more space available on the file system, clones, snapshots, and especially deduplication make it difficult to release disk space. Actually, deleting a file does no longer automatically mean making available the disk blocks associated to the file, as they can also be used by others.
Backup and restoration
The zfs send
and zfs recv
commands
respectively perform a backup and a restoration of the file system or
of a set of file systems. They can be seen as the commands dump and
restore for traditional file systems like UFS.
- Backup
-
Send a data stream 1 2
zfs send -R tank/web@today # Full zfs send -R -I @last-month tank/web@today # Incremental
Option Description -R descendants are included -I @tag incrementally sending all the intermediate snapshots since tag The ability to send a deduplicated stream (option:
-D
) has been deprecated1. - Restore
-
Receive a data stream 1
zfs recv -u -F -d tank/backup
Option Description -u file system is not mounted -d snapshots names used are the ones included in the stream -F rollback of the file system is performed if changes were made to the destination
Commands can be put together using ssh
to perform a
backup on a remote server:
1 2 |
|
Clone
For example, cloning a VM with the squeeze
version of debian to
create a new project: projectX
1 |
|
Snaphot and rollback
The snapshot mecanism allows for example to implement archiving, or coupled with the rollback mechanism it allows to guard against an unsuccessful update.
Snapshot
Creates on the mentioned file system, a snapshot with the tag
just-in-case
, or a recursive snapshot including all its descendants
with the current date (date +% Y-% m-% d
)
1 2 |
|
Option | Description |
---|---|
-r | atomically creates a snapshot on the file system and its descendants |
Rollback
Performs a rollback to a previous snapshot:
1 |
|
Option | Description |
---|---|
-r | also destroys the snapshots newer than the specified one |
-R | also destroys the snapshots newer than the one specified and their clones |
-f | forces an unmount of any clone file systems that are to be destroyed |
There are currently no options to apply a rollback recursively on the
descendents (ie: no equivalence for zfs snapshot -r
), it will be
necessary in this case to manually perform a rollback on each file
systems.
-
https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSStreamDedupGone :
Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it. ↩