About CFS

SFCFS uses a master/slave, or primary/secondary, architecture to manage file system metadata on shared disk storage. The first node that mounts is the primary node and the remaining nodes are secondaries. Secondaries send requests to the primary to perform metadata updates.The primary node updates the metadata and maintains the file system's metadata update intent log. Data can also be updated from any node directly to shared storage. Other file system operations, such as allocating or deleting files, can originate from any node in the cluster.

If the server on which the SFCFS primary is running fails, the remaining cluster nodes elect a new primary. See Distributing Load on a Cluster for details on the election process. The new primary reads the file system intent log and completes any metadata updates that were in process at the time of the failure. Application I/O from other nodes may block during this process and cause a delay. When the file system is again consistent, application processing resumes.

Because nodes using a cluster file system in secondary mode do not update file system metadata directly, failure of a secondary node does not require metadata repair. SFCFS recovery from secondary node failure is therefore faster than from primary node failure.

SFCFS and the Group Lock Manager

SFCFS uses the VERITAS Group Lock Manager (GLM) to reproduce UNIX single-host file system semantics in clusters. UNIX file systems make writes appear atomic. This means when an application writes a stream of data to a file, a subsequent application reading from the same area of the file retrieves the new data, even if it has been cached by the file system and not yet written to disk. Applications cannot retrieve stale data or partial results from a previous write.

To reproduce single-host write semantics, system caches must be kept coherent, and each must instantly reflect updates to cached data, regardless of the node from which they originate.

Asymmetric Mounts

A VxFS file system mounted with the mount –o cluster option is a cluster, or shared, mount, as opposed to a non-shared or local mount. A file system mounted in shared mode must be on a VxVM shared volume in a cluster environment. A local mount cannot be remounted in shared mode and a shared mount cannot be remounted in local mode. File systems in a cluster can be mounted with different read/write options. These are called asymmetric mounts.

Asymmetric mounts allow shared file systems to be mounted with different read/write capabilities. One node in the cluster can mount read/write, while other nodes mount read-only.

You can specify the cluster read-write (crw) option when you first mount the file system, or the options can be altered when doing a remount (mount –o remount). The first column in the following table shows the mode in which the primary is mounted. The check marks indicate the mode secondary mounts can use. See the mount_vxfs(1M) man page for details on the cluster read-write (crw) mount option.

	Secondary
		ro	rw	ro, crw
Primary	ro
	rw
	ro, crw

Mounting the primary with only the –o cluster,ro option prevents the secondaries from mounting in a different mode; that is, read/write. Note that rw implies read/write capability throughout the cluster.

Parallel I/O

Some distributed applications read and write to the same file concurrently from one or more nodes in the cluster; for example, any distributed application where one thread appends to a file and there are one or more threads reading from various regions in the file. Several high-performance compute (HPC) applications can also benefit from this feature, where concurrent I/O is performed on the same file. Applications do not require any changes to use parallel I/O feature.

Traditionally, the entire file is locked to perform I/O to a small region. To support parallel I/O, CFS locks ranges in a file that correspond to an I/O request. The granularity of the locked range is a page. Two I/O requests conflict if at least one is a write request, and the I/O range of the request overlaps the I/O range of the other.

The parallel I/O feature enables I/O to a file by multiple threads concurrently, as long as the requests do not conflict. Threads issuing concurrent I/O requests could be executing on the same node, or on a different node in the cluster.

An I/O request that requires allocation is not executed concurrently with other I/O requests. Note that when a writer is extending the file and readers are lagging behind, block allocation is not necessarily done for each extending write.

If the file size can be predetermined, the file can be preallocated to avoid block allocations during I/O. This improves the concurrency of applications performing parallel I/O to the file. Parallel I/O also avoids unnecessary page cache flushes and invalidations using range locking, without compromising the cache coherency across the cluster.

For applications that update the same file from multiple nodes, the -nomtime mount option provides further concurrency. Modification and change times of the file are not synchronized across the cluster, which eliminates the overhead of increased I/O and locking.

CFS Namespace

The mount point name must remain the same for all nodes mounting the same cluster file system. This is required for the VCS mount agents (online, offline, and monitoring) to work correctly.

SFCFS Backup Strategies

The same backup strategies used for standard VxFS can be used with SFCFS because the APIs and commands for accessing the namespace are the same. File System checkpoints provide an on-disk, point-in-time copy of the file system. Because performance characteristics of a checkpointed file system are better in certain I/O patterns, they are recommended over file system snapshots (described below) for obtaining a frozen image of the cluster file system.

File System snapshots are another method of a file system on-disk frozen image. The frozen image is non-persistent, in contrast to the checkpoint feature. A snapshot can be accessed as a read-only mounted file system to perform efficient online backups of the file system. Snapshots implement "copy-on-write" semantics that incrementally copy data blocks when they are overwritten on the snapped file system. Snapshots for cluster file systems extend the same copy-on-write mechanism for the I/O originating from any cluster node.

Mounting snapshot file system for backups increases the load on the system because of the resources used to perform copy-on-writes and to read data blocks from snapshot. In this situation, cluster snapshots can be used to do off-host backups. Off-host backups reduce the load of a backup application from the primary server. Overhead from remote snapshots is small when compared to overall snapshot overhead. Therefore, running a backup application by mounting a snapshot from a relatively less loaded node is beneficial to overall cluster performance.

There are several characteristics of a cluster snapshot, including:

A snapshot for a cluster mounted file system can be mounted on any node in a cluster. The file system can be a primary, secondary, or secondary-only. A stable image of the file system is provided for writes from any node.
Multiple snapshots of a cluster file system can be mounted on the same or different cluster node.
A snapshot is accessible only on the node mounting a snapshot. The snapshot device cannot be mounted on two nodes simultaneously.
The device for mounting a snapshot can be a local disk or a shared volume. A shared volume is used exclusively by a snapshot mount and is not usable from other nodes as long as the snapshot is active on that device.
On the node mounting a snapshot, the snapped file system cannot be unmounted while the snapshot is mounted.
A SFCFS snapshot ceases to exist if it is unmounted or the node mounting the snapshot fails. However, a snapshot is not affected if a node leaves or joins the cluster.
A snapshot of a read-only mounted file system cannot be taken. It is possible to mount snapshot of a cluster file system only if the snapped cluster file system is mounted with the crw option.

In addition to file-level frozen images, there are volume-level alternatives available for shared volumes using mirror split and rejoin. Features such as Fast Mirror Resync and Space Optimized snapshot are also available. See the VERITAS Volume Manager System Administrator's Guide for details.

Synchronizing Time on Cluster File Systems

SFCFS requires that the system clocks on all nodes are synchronized using some external component such as the Network Time Protocol (NTP) daemon. If the nodes are not in sync, timestamps for creation (ctime) and modification (mtime) may not be consistent with the sequence in which operations actually happened.

Distributing Load on a Cluster

Because the primary node on the cluster performs all metadata updates for a given file system, it makes sense to distribute this load by designating different nodes to serve as primary for each of the cluster file systems in use. Distributing workload in a cluster provides performance and failover advantages. Because each cluster mounted file system can have a different node as its primary, SFCFS enables easy load distribution.

For example, if you have eight file systems and four nodes, designating two file systems per node as the primary is beneficial. The first node that mounts a file system becomes the primary for that file system.

You can also use the fsclustadm to designate a SFCFS primary. The fsclustadm setprimary mount point can be used to change the primary. This change to the primary is not persistent across unmounts or reboots. The change is in effect as long as one or more nodes in the cluster have the file system mounted. The primary selection policy can also be defined by a VCS attribute associated with the SFCFS mount resource.

A node can be mounted using the mount –seconly option, the result of which the node does not assume the primary mode for the file system. If a primary node fails and the remaining nodes are have the file systems mounted as secondary-only, the file system is disabled. The mount –seconly option overrides the fsclustadm command. See the mount_vxfs(1M) man page for details on the –seconly option.

Note that recovery for the primary takes more time than the secondary. Distributing primaries across the cluster would provide a uniform recovery time in the case of a node failure.

File System Tuneables

Tuneable parameters are updated at the time of mount using the tunefstab file or vxtunefs command. The file system tunefs parameters are set to be identical on all nodes by propagating the parameters to each cluster node. When the file system is mounted on the node, the tunefs parameters of the primary node are used. The tunefstab file on the node is used if this is the first node to mount the file system. VERITAS recommends that this file be identical on each node.

Split-Brain and Jeopardy Handling

A split-brain occurs when the cluster membership view differs among the cluster nodes, increasing the chance of data corruption. Membership change also occurs when all private-link cluster interconnects fail simultaneously, or when a node is unable to respond to heartbeat messages.With I/O fencing, the potential for data corruption is eliminated. I/O fencing requires disks that support SCSI-3 PGR.

Jeopardy State

In the absence of I/O fencing, SFCFS installation requires two heartbeat links. When a node is down to a single heartbeat connection, SFCFS can no longer discriminate between loss of a system and loss of the final network connection. This state is defined as jeopardy.

SFCFS employs jeopardy to prevent data corruption following a split-brain. Note that in certain scenarios, the possibility of data corruption remains. For example:

All links go down simultaneously.
A node hangs and is unable to respond to heartbeat messages.

To eliminate the chance of data corruption in these scenarios, I/O fencing is required. With I/O fencing, the jeopardy state does not require special handling by the SFCFS stack.

Jeopardy Handling

For installations that do not support SCSI-3 PGR, potential split-brain conditions are safeguarded by jeopardy handling. If any cluster node fails following a jeopardy state notification, the cluster file system mounted on the failed nodes is disabled. If a node fails after the jeopardy state notification, all cluster nodes also leave the shared disk group membership.

Recovering from Jeopardy

The disabled file system can be restored by a force unmount and the resource can be brought online without rebooting, which also brings the shared disk group resource online. Note that if the jeopardy condition is not fixed, the nodes are susceptible to leaving the cluster again on subsequent node failure. For a detailed explanation of this topic, see the VERITAS Cluster Server User's Guide.

Fencing

With the use of I/O enabled fencing, all remaining cases with the potential to corrupt data (for which jeopardy handling cannot protect) are addressed. For more information see Fencing Administration.

Single Network Link and Reliability

Certain environments may prefer using a single private link or a pubic network for connecting nodes in a cluster, despite the loss of redundancy for dealing with network failures. The benefits of this approach include simpler hardware topology and lower costs; however, there is obviously a tradeoff with high availability.

For the above environments, SFCFS provides the option of a single private link, or using the public network as the private link if I/O fencing is present. Note that these nodes start in jeopardy state, as described in I/O Fencing. I/O fencing is used to handle split-brain scenarios. The option for single network is given during installation.

Low Priority Link

LLT can be configured to use a low-priority network link as a backup to normal heartbeat channels. Low-priority links are typically configured on the customer's public or administrative network. This typically results in a completely different network infrastructure than the cluster private interconnect, and reduces the chance of a single point of failure bringing down all links. The low-priority link is not used for cluster membership traffic until it is the only remaining link. In normal operation, the low-priority link carries only heartbeat traffic for cluster membership and link state maintenance. The frequency of heartbeats drops 50 percent to reduce network overhead. When the low-priority link is the only remaining network link, LLT also switches over all cluster status traffic. Following repair of any configured private link, LLT returns cluster status traffic to the high-priority link.

LLT links can be added or removed while clients are connected. Shutting down GAB or the high-availability daemon, HAD, is not required.

To add a link

# lltconfig -d device -t tag

To remove a link

lltconfig -u tag

Changes take effect immediately and are lost on the next reboot. For changes to span reboots you must also update /etc/llttab.

Note LLT clients do not recognize the difference unless only one link is available and GAB declares jeopardy.

I/O Error Handling Policy

I/O errors can occur for several reasons, including failures of Fibre Channel link, host-bus adapters, and disks. SFCFS disables the file system on the node encountering I/O errors. The file system remains available from other nodes.

After the hardware error is fixed (for example, the Fibre Channel link is reestablished), the file system can be force unmounted and the mount resource can be brought online from the disabled node to reinstate the file system.


^ Return to Top	< Previous \| Next >

Product: Storage Foundation Cluster File System Guides
Manual: Cluster File System 4.1 Installation and Administration Guide
VERITAS Software Corporation www.veritas.com