Table of content

Pool
File systems

Use of the ZFS file system. ZFS has several interesting caracteristics such as: raid, self healing, snapshot, clone, rollback, cache, deduplication, copy-on-write.

Reference: Open ZFS
Follow-up: ZFS debugging

Pool

Creation

When to (and not to) use RAID-Z

The disk pool will be built according to the capacity and redundancy required, and of course the number of available disks. The different groups of disks that can be added to the pool are detailed below:

Group	Redundancy	RAID alternative
`disk`	0	RAID0 / JBOD
`mirror`	1 → n	RAID1 / Mirror
`raidz1`	1	RAID5
`raidz2`	2	RAID6
`raidz3`	3

Having a redundancy of type raidz reduced performance due to parity calculations required for distribution of data accros the disks. But it helps protect a group against a loss of 1 disk (raidz1), 2 disks (raidz2) or 3 disks (raidz3) without the need to dedicate 50% of the initial storage capacity for redundancy as it is the case in a mirror (mirror).

The advantage of having at least 2 disks for redundancy (type configuration raidz2 , raidz3) is to guard against cascading failures:

if the disks are from the same batch, they may have the same defect, and thus fail at nearly the same time;
the time required to rebuild the group render it vulnerable to the loss of additional disks, especially since this period tends to be longer with the current disk capacity (and can strech to a few days), and the intensive read stress the remaining disks.

Examples of pool creation (pool is called tank) consisting of a single group:

Pool creation

zpool create tank        da0                      # RAID0 / JBOD     (1+0)
zpool create tank mirror da0 da1                  # RAID1 / Mirror   (1+1)
zpool create tank raidz  da0 da1 da2              # RAID5            (2+1)
zpool create tank raidz2 da0 da1 da2 da3 da4      # RAID6            (3+2)

Available disk space on device replacement must be at least equal to the one it replaces. Therefore it may be desirable not to dedicate the entire disk to ZFS, but to protected ourself by creating a partition slightly smaller than the size of the disk. The announced capacity may indeed differ somehow depending on the manufacturers and references. This can even be the case on disks that should be identical: same manufacturer, same reference, same purchase date.

As it is hardly possible to be behind each array in case of problems, there is an option to mark some disks as spare, the spare will immediately replace the failed disk and start the repair process.

Adding a spare

1	`zpool add tank spare da6`

To improve ZIL performance, the ZIL (ZFS Intent Log) can be placed on a dedicated disk with better throughput and access time (typically on an SSD device). The ZIL is responsible for satisfing the POSIX requirement for read/write being synchronous.

Addind Separate intent LOG (SLOG, aka: ZIL cache or write cache)

1 2	`zpool add tank log da7 # ZIL is on a single disk zpool add tank log mirror da7 da8 # ZIL is protected by a mirror`

Also for performance reasons, it is possible to add a cache consisting of one or more disks, here also, speed and low access time are paramount.

Adding cache (aka: read cache)

1	`zpool add tank cache da8`

Verification

Conduct an audit of the pool (thanks to the checksums present on the blocks), and repair it, if possible, from the redundancy available (mirror, raidz, and multiple copies). This is also an indirect way to restart an aborted resilvering phase.

Verify data integrity

1	`zpool scrub tank`

Displays the status of the pools and in case of problems the list the files impacted by the errors.

List damaged files

1	`zpool status -x`

Performances

Provides information on the performance of the pool.

Statistics about pool I/O

1	`zpool iostat`

Disks replacement

Replacing disk is done by specifying the disk device to replace, /dev/old_device, and the replacing disk device /dev/new_device:

Replace a device with another

1	`zpool replace pool /dev/old_device /dev/new_device`

In the case where the disk to replace is part of pool with redundancy, it is then possible to directly replace the disk by another one in the same place. The command is simplified to:

Replace a device, assuming the disk has been changed

1	`zpool replace pool /dev/device`

File systems

Logical volume

It is possible to use the ZFS pool not to create a file system, but a logical volume (ie: a raw block device ) that can be used later to create other ZFS pool, iSCSI disks, …

Allocate volume device

1	`zfs create -V 50G tank/iscsi/web # Available at: /dev/zvol/tank/iscsi/web`

Compression

Several compression algorithms are available with different compression speed and ratio (see table below). But even if new algorithms have been introduced lately, the old one are kept for compatibility (allowing to read previously written compressed data).

Name	Description	Obsolete	Version
`lzjb`	Initial ZFS high speed compression
`gzip`	High compression ratio (high CPU)		5
`zle`	Compress runs of zero		20
`lz4`	Extremely fast compression	`lzjb`	`lz4_compress`
`zstd`	Real-time compression at zlib-level	`gzip`	`zstd_compress`

Compression algorithm choice according to the compression and/or performance is usually between:

lz4 : fast compression (the default algorithm)
zstd : high compression ratio

Deduplication

Deduplication allows saving disk space by keeping only one or a few copies (see dedupditto) of identical blocks.

Enable deduplication

1	`zfs set dedup=sha256,verify tank/home`

In the case where there is no more space available on the file system, clones, snapshots, and especially deduplication make it difficult to release disk space. Actually, deleting a file does no longer automatically mean making available the disk blocks associated to the file, as they can also be used by others.

Backup and restoration

The zfs send and zfs recv commands respectively perform a backup and a restoration of the file system or of a set of file systems. They can be seen as the commands dump and restore for traditional file systems like UFS.

Backup

Send a data stream

1 2	`zfs send -R tank/web@today # Full zfs send -R -I @last-month tank/web@today # Incremental`

Option	Description
-R	descendants are included
-I @tag	incrementally sending all the intermediate snapshots since tag

The ability to send a deduplicated stream (option: -D) has been deprecated¹.

Restore

Receive a data stream

1	`zfs recv -u -F -d tank/backup`

Option	Description
-u	file system is not mounted
-d	snapshots names used are the ones included in the stream
-F	rollback of the file system is performed if changes were made to the destination

Commands can be put together using ssh to perform a backup on a remote server:

Copy snapshots to another system

1 2	`zfs send -R -I @last-month tank/web@today \| ssh backup.example.com zfs recv -u -F -d tank/backup`

Clone

For example, cloning a VM with the squeeze version of debian to create a new project: projectX

Create a clone from a snapshot

1	`zfs clone tank/vm/debian@squeeze tank/vm/projectX`

Snaphot and rollback

The snapshot mecanism allows for example to implement archiving, or coupled with the rollback mechanism it allows to guard against an unsuccessful update.

Snapshot

Creates on the mentioned file system, a snapshot with the tag just-in-case, or a recursive snapshot including all its descendants with the current date (date +% Y-% m-% d)

zfs snapshot    tank/system/pkg@just-in-case    # To protect from an update
zfs snapshot -r tank@`date +%Y-%m-%d`           # For archiving via a cron job

Option	Description
-r	atomically creates a snapshot on the file system and its descendants

Rollback

Performs a rollback to a previous snapshot:

zfs rollback tank/system/pkg@just-in-case

Option	Description
-r	also destroys the snapshots newer than the specified one
-R	also destroys the snapshots newer than the one specified and their clones
-f	forces an unmount of any clone file systems that are to be destroyed

There are currently no options to apply a rollback recursively on the descendents (ie: no equivalence for zfs snapshot -r), it will be necessary in this case to manually perform a rollback on each file systems.

https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSStreamDedupGone :

Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it. ↩