sun.com \| docs.sun.com	How To Buy \| My Sun \| Worldwide Sites

Previous Contents Next

Chapter 1

Introduction

This chapter provides an overview of ZFS and its features and benefits. It also covers some basic terminology used throughout the rest of this book.

The following sections are provided in this chapter.

1.1 What is ZFS?

The Zettabyte File System (ZFS) is a revolutionary new filesystem that fundamentally changes the way filesystems are administered, with features and benefits not found in any other filesystem available today. ZFS has been designed from the ground up to be robust, scalable, and simple to administer.

1.1.1 Pooled Storage

ZFS uses the concept of Storage Pools to manage physical storage. Historically, filesystems were constructed on top of a single physical device. In order to address multiple devices and provide for data redundancy, the concept of a Volume Manager was introduced to provide the image of a single device so that filesystems would not have to be modified to take advantage of multiple devices. This added another layer of complexity, and ultimately prevented certain filesystem advances, since the filesystem had no control over the physical placement of data on the virtualized volumes.

ZFS does away with the volume manager altogether. Instead of forcing the administrator to create virtualized volumes, ZFS aggregates devices into a storage pool. The storage pool describes the physical characteristics of the storage (device layout, data redundancy, etc.) and acts as an arbitrary data store from which filesystems can be created. Filesystems are no longer constrained to individual devices, allowing them to share space with all filesystems in the pool. There is no need to predetermine the size of a filesystem, as they grow automatically within the space allocated to the storage pool. When new storage is added, all filesystems within the pool can immediately make use of the additional space without additional work. In many ways, the storage pool acts as a virtual memory system. When a memory DIMM is added to a system, the operating system doesn't force the administrator to invoke some commands to configure the memory and assign it to individual processes -- all processes on the system automatically make use of the additional memory.

1.1.2 Transactional Semantics

ZFS is a transactional filesystem, which means that the filesystem state is always consistent on disk. Traditional filesystems overwrite data in place, which means that if the machine loses power between, say, the time a data block is allocated and when it is linked into a directory, the filesystem will be left in an inconsistent state. Historically, this was solved through the use of the fsck(1M) command, which was responsible for going through and verifying filesystem state, making an attempt to repair it in the process. This caused great pain to administrators, and was never guaranteed to fix all possible problems. More recently, filesystems have introduced the idea of journaling, which records action in a separate journal which can then be replayed safely in event of a crash. This introduces unnecessary overhead (the data needs to be written twice) and often results in a new set of problems (such as when the journal can't be replayed properly).

With a transactional filesystem, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. This means that the filesystem can never be corrupted through accidental loss of power or a system crash, and there is no need for a fsck(1M) equivalent. While the most recently written pieces of data may be lost, the filesystem itself will always be consistent. In addition, synchronous data (written using the O_DSYNC flag) is always guaranteed to be written before returning, so it is never lost.

1.1.3 Checksums and Self-Healing Data

With ZFS, all data and metadata is checksummed using a user-selectable algorithm. Those traditional filesystems that do provide checksumming have performed it on a per-block basis, out of necessity due to the volume manager layer and traditional filesystem design. This means that certain failure modes (such as writing a complete block to an incorrect location) can result in properly checksummed data that is actually incorrect. ZFS checksums are stored in a way such that these failure modes are detected and can be recovered from gracefully. All checksumming and data recovery is done at the filesystem layer, and is transparent to the application.

In addition, ZFS provides for self-healing data. ZFS supports storage pools with varying levels of data redundancy, including mirroring and a variation on RAID-5. When a bad data block is detected, not only does ZFS fetch the correct data from another replicated copy, but it will also go and repair the bad data, replacing it with the good copy.

1.1.4 Unparalleled Scalability

ZFS has been designed from the ground up to be the most scalable filesystem, ever. The filesystem itself is a 128-bit filesystem, allowing for 256 quadrillion zettabytes of storage. All metadata is allocated dynamically, so there is no need to pre-allocate inodes or otherwise limit the scalability of the filesystem when it is first created. All the algorithms have been written with scalability in mind. Directories can have up to 2⁴⁸ (256 trillion) entries, and there is no limit on the number of filesystems or number of files within a filesystem.

1.1.5 Snapshots and Clones

A snapshot is a read-only copy of a filesystem or volume. Snapshots can be created quickly and easily. Initially, snapshots consume no additional space within the pool.

As data within the active dataset changes, the snapshot consumes space by continuing to reference the old data, and so, prevents it from being freed back to the pool.

1.1.6 Simplified Administration

Most importantly, ZFS provides a greatly simplified administration model. Through the use of hierarchical filesystem layout, property inheritance, and auto-management of mount points and NFS share semantics, ZFS makes it easy to create and manage filesystems without needing multiple different commands or editing configuration files. The administrator can easily set quotas or reservations, turn compression on or off, or manage mount points for large numbers of filesystems with a single command. Devices can be examined or repaired without having to understand a separate set of volume manager commands. Administrators can take an unlimited number of instantaneous snapshots of filesystems, and can backup and restore individual filesystems.

ZFS manages filesystems through a hierarchy that allows for this simplified management of properties such as quotas, reservations, compression, and mount points. In this model, filesystems become the central point of control. Filesystems themselves are very cheap (equivalent to a new directory), so administrators are encouraged to create a filesystem for each user, project, workspace, etc. This allows the administrator to define arbitrarily fine-grained management points.

1.2 ZFS Terminology

The following table covers the basic terminology used throughout this book.

checksum	A 256-bit hash of the data in a filesystem block. The checksum function can be anything from the simple and fast fletcher2 (the default) to cryptographically strong hashes such as SHA256.
clone	A filesystem whose initial contents are identical to that of a snapshot. For information about clones, see 6.2 ZFS Clones.
dataset	A generic name for ZFS entities: clones, filesystems, snapshots, or volumes. For more information about datasets, see Chapter 5, Managing Filesystems.
filesystem	A dataset that contains a standard POSIX filesystem. For more information about filesystems, see Chapter 5, Managing Filesystems.
mirror	A virtual device that stores identical copies of data on two or more disks. If any disk in a mirror fails, any one of the other disks in that mirror can provide the same data.
pool	A logical group of devices describing the layout and physical characteristics of available storage. Space for datasets is allocated from a pool. For more information about storage pools, see Chapter 4, Managing Storage Pools.
RAID-Z	A virtual device that stores data and parity on multiple disks, similar to RAID-5. All traditional RAID-5-like algorithms (RAID-4. RAID-5. RAID-6, RDP, and EVEN-ODD, for example) suffer from a problem known as the "RAID-5 write hole": if only part of a RAID-5 stripe is written, and power is lost before all blocks have made it to disk, the parity will remain out of sync with data - and therefore useless - forever (unless a subsequent full-stripe write overwrites it). In RAID-Z, ZFS uses variable-width RAID stripes so that all writes are full-stripe writes. This is only possible because ZFS integrates filesystem and device management in such a way that the filesystem's metadata has enough information about the underlying data replication model to handle variable-width RAID stripes. RAID-Z is the world's first software-only solution to the RAID-5 write hole.
snapshot	A read-only image of a filesystem or volume at a given point in time. For more information about snapshots, see 6.1 ZFS Snapshots.
virtual device	A logical device in a pool, which can be a physical device, a file, or a collection of devices. For more information on virtual devices, see 4.1 Virtual Devices.
volume	A dataset used to emulate a physical device in order to support legacy filesystems. For more information on emulated volumes, see 8.1 Emulated Volumes.

Each dataset is identified by a unique name of the ZFS namespace. Datasets are identified using the following format:

pool/path[@snapshot]

pool	identifies the name of the storage pool that contains the dataset
path	is a slash-delimited pathname for the dataset object
snapshot	is an optional component that identifies a snapshot of a dataset

1.3 ZFS Component Naming Conventions

Each ZFS component must be named according to the following rules:

Empty components are not allowed.
Each component can only be composed of alphanumeric characters plus the following special characters: _, -, :, and .
Pool names must begin with a letter, except that the beginning sequence c[0-9] is not allowed. The pool names 'raidz' and 'mirror' are reserved names.
Dataset names must begin with an alphanumeric character.

Previous Contents Next