Tuning VxFS

VERITAS File System provides a set of tuning options to optimize file system performance for different application workloads. VxFS provides a set of tunable I/O parameters that control some of its behavior. These I/O parameters help the file system adjust to striped or RAID-5 volumes that could yield performance far superior to a single disk. Typically, data streaming applications that access large files see the largest benefit from tuning the file system.

Most of these tuning options have little or no impact on database performance when using Quick I/O, with the exception of the max_thread_proc parameter. Other than setting the max_thread_proc parameter (see max_thread_proc), use the general VxFS defaults when creating a VxFS file system for databases. However, you can gather file system performance data when using Quick I/O, and use this information to adjust the system configuration to make the most efficient use of system resources.

Monitoring Free Space

In general, VxFS works best if the percentage of free space in the file system is greater than 10 percent. This is because file systems with 10 percent or more of free space have less fragmentation and better extent allocation. Regular use of the df command to monitor free space is desirable. Full file systems may have an adverse effect on file system performance. Full file systems should therefore have some files removed or should be expanded. See the fsadm_vxfs(1M) manual page for a description of online file system expansion.

Monitoring Fragmentation

Fragmentation reduces performance and availability. Regular use of fsadm's fragmentation reporting and reorganization facilities is therefore advisable.

The easiest way to ensure that fragmentation does not become a problem is to schedule regular defragmentation runs using the cron command.

Defragmentation scheduling should range from weekly (for frequently used file systems) to monthly (for infrequently used file systems). Extent fragmentation should be monitored with fsadm or the df -o s commands. There are three factors that can be used to determine the degree of fragmentation:

Percentage of free space in extents that are less than eight blocks in length
Percentage of free space in extents that are less than 64 blocks in length
Percentage of free space in extents that are 64 or more blocks in length

An unfragmented file system will have the following characteristics:

Less than 1 percent of free space in extents that are less than eight blocks in length
Less than 5 percent of free space in extents that are less than 64 blocks in length
More than 5 percent of the total file system size available as free extents that are 64 or more blocks in length

A badly fragmented file system will have one or more of the following characteristics:

More than 5 percent of free space in extents that are less than 8 blocks in length
More than 50 percent of free space in extents that are less than 64 blocks in length
Less than 5 percent of the total file system size available as free extents that are 64 or more blocks in length

The optimal period for scheduling extent reorganization runs can be determined by choosing a reasonable interval, scheduling fsadm runs at the initial interval, and running the extent fragmentation report feature of fsadm before and after the reorganization.

The "before" result is the degree of fragmentation prior to the reorganization. If the degree of fragmentation approaches the percentages for bad fragmentation, reduce the interval between fsadm. If the degree of fragmentation is low, increase the interval between fsadm runs.

Tuning VxFS I/O Parameters

VxFS provides a set of tunable I/O parameters that control some of its behavior. These I/O parameters are useful to help the file system adjust to striped or RAID-5 volumes that could yield performance far superior to a single disk. Typically, data streaming applications that access large files see the biggest benefit from tuning the file system.

If VxFS is being used with VERITAS Volume Manager, the file system queries VxVM to determine the geometry of the underlying volume and automatically sets the I/O parameters. VxVM is queried by mkfs when the file system is created to automatically align the file system to the volume geometry. If the default alignment from mkfs is not acceptable, the -o align=n option can be used to override alignment information obtained from VxVM. The mount command also queries VxVM when the file system is mounted and downloads the I/O parameters.

If the default parameters are not acceptable or the file system is being used without VxVM, then the /etc/vx/tunefstab file can be used to set values for I/O parameters. The mount command reads the /etc/vx/tunefstab file and downloads any parameters specified for a file system. The tunefstab file overrides any values obtained from VxVM. While the file system is mounted, any I/O parameters can be changed using the vxtunefs command, which can have tunables specified on the command line or can read them from the /etc/vx/tunefstab file. For more details, see the vxtunefs(1M) and tunefstab(4) manual pages. The vxtunefs command can be used to print the current values of the I/O parameters.

Tunable VxFS I/O Parameters

read_pref_io	The preferred read request size. The file system uses this parameter in conjunction with the read_nstream value to determine how much data to read ahead. The default value is 64K.
write_pref_io	The preferred write request size. The file system uses this parameter in conjunction with the write_nstream value to determine how to do flush behind on writes. The default value is 64K.
read_nstream	The number of parallel read requests of size read_pref_io that you can have outstanding at one time. The file system uses the product of read_nstream multiplied by read_pref_io to determine its read ahead size. The default value for read_nstream is 1.
write_nstream	The number of parallel write requests of size write_pref_io that you can have outstanding at one time. The file system uses the product of write_nstream multiplied by write_pref_io to determine when to do flush behind on writes. The default value for write_nstream is 1.
discovered_direct_iosz	Any file I/O requests larger than the discovered_direct_iosz are handled as discovered direct I/O. A discovered direct I/O is unbuffered similar to direct I/O, but does not require a synchronous commit of the inode when the file is extended or blocks are allocated. For larger I/O requests, the CPU time for copying the data into the page cache and the cost of using memory to buffer the I/O data becomes more expensive than the cost of doing the disk I/O. For these I/O requests, using discovered direct I/O is more efficient than regular I/O. The default value of this parameter is 256K.
initial_extent_ size	Changes the default initial extent size. VxFS determines the size of the first extent to be allocated to the file based on the first write to a new file. Normally, the first extent is the smallest power of 2 that is larger than the size of the first write. If that power of 2 is less than 8K, the first extent allocated is 8K. After the initial extent, the file system increases the size of subsequent extents (see max_seqio_extent_size) with each allocation. Since most applications write to files using a buffer size of 8K or less, the increasing extents start doubling from a small initial extent. initial_extent_size can change the default initial extent size to be larger, so the doubling policy will start from a much larger initial size and the file system will not allocate a set of small extents at the start of file. Use this parameter only on file systems that will have a very large average file size. On these file systems, it will result in fewer extents per file and less fragmentation. initial_extent_size is measured in file system blocks.
max_buf_data_size	The maximum buffer size allocated for file data; either 8K bytes or 64K bytes. Use the larger value for workloads where large reads/writes are performed sequentially. Use the smaller value on workloads where the I/O is random or is done in small chunks. The default value is 8K bytes.
max_direct_iosz	The maximum size of a direct I/O request that will be issued by the file system. If a larger I/O request comes in, then it is broken up into max_direct_iosz chunks. This parameter defines how much memory an I/O request can lock at once, so it should not be set to more than 20 percent of memory.
max_diskq	Limits the maximum disk queue generated by a single file. When the file system is flushing data for a file and the number of pages being flushed exceeds max_diskq, processes will block until the amount of data being flushed decreases. Although this doesn't limit the actual disk queue, it prevents flushing processes from making the system unresponsive. The default value is 1MB.
max_seqio_extent_size	Increases or decreases the maximum size of an extent. When the file system is following its default allocation policy for sequential writes to a file, it allocates an initial extent that is large enough for the first write to the file. When additional extents are allocated, they are progressively larger (the algorithm tries to double the size of the file with each new extent) so each extent can hold several writes' worth of data. This is done to reduce the total number of extents in anticipation of continued sequential writes. When the file stops being written, any unused space is freed for other files to use. Normally, this allocation stops increasing the size of extents at 2048 blocks, which prevents one file from holding too much unused space. max_seqio_extent_size is measured in file system blocks.
qio_cache_enable	Enables or disables caching on Quick I/O files. The default behavior is to disable caching. To enable caching, set qio_cache_enable to 1. On systems with large memories, the database cannot always use all of the memory as a cache. By enabling file system caching as a second level cache, performance may be improved. If the database is performing sequential scans of tables, the scans may run faster by enabling file system caching so the file system will perform aggressive read-ahead on the files.
write_throttle	The write_throttle parameter is useful in special situations where a computer system has a combination of a lot of memory and slow storage devices. In this configuration, sync operations (such as fsync()) may take so long to complete that the system appears to hang. This behavior occurs because the file system is creating dirty pages (in-memory updates) faster than they can be asynchronously flushed to disk without slowing system performance. Lowering the value of write_throttle limits the number of dirty pages per file that a file system will generate before flushing the pages to disk. After the number of dirty pages for a file reaches the write_throttle threshold, the file system starts flushing pages to disk even if free memory is still available. The default value of write_throttle typically generates a lot of dirty pages, but maintains fast user writes. Depending on the speed of the storage device, if you lower write_throttle, user write performance may suffer, but the number of dirty pages is limited, so sync operations will complete much faster. Because lowering write_throttle can delay write requests (for example, lowering write_throttle may increase the file disk queue to the max_diskq value, delaying user writes until the disk queue decreases), it is recommended that you avoid changing the value of write_throttle unless your system has a a large amount of physical memory and slow storage devices.

If the file system is being used with VxVM, it is recommended that you set the VxFS I/O parameters to default values based on the volume geometry.

If the file system is being used with a hardware disk array or volume manager other than VxVM, align the parameters to match the geometry of the logical disk. With striping or RAID-5, it is common to set read_pref_io to the stripe unit size and read_nstream to the number of columns in the stripe. For striping arrays, use the same values for write_pref_io and write_nstream, but for RAID-5 arrays, set write_pref_io to the full stripe size and write_nstream to 1.

For an application to do efficient disk I/O, it should issue read requests that are equal to the product of read_nstream multiplied by read_pref_io. Generally, any multiple or factor of read_nstream multiplied by read_pref_io should be a good size for performance. For writing, the same rule of thumb applies to the write_pref_io and write_nstream parameters. When tuning a file system, the best thing to do is try out the tuning parameters under a real-life workload.

If an application is doing sequential I/O to large files, it should issue requests larger than the discovered_direct_iosz. This causes the I/O requests to be performed as discovered direct I/O requests, which are unbuffered like direct I/O but do not require synchronous inode updates when extending the file. If the file is too large to fit in the cache, then using unbuffered I/O avoids throwing useful data out of the cache and lessons CPU overhead.

Obtaining File I/O Statistics using the Quick I/O Interface

The qiostat command provides access to activity information on Quick I/O files on VxFS file systems. The command reports statistics on the activity levels of files from the time the files are first opened using their Quick I/O interface. The accumulated qiostat statistics are reset once the last open reference to the Quick I/O file is closed.

The qiostat command displays the following I/O statistics:

Number of read and write operations
Number of data blocks (sectors) transferred
Average time spent on read and write operations

When Cached Quick I/O is used, qiostat also displays the caching statistics when the -l (the long format) option is selected.

The following is an example of qiostat output:

                                            OPERATIONS                                            FILE BLOCKS                                             AVG TIME(ms)
    FILENAME                                    READ                    WRITE                        READ                    WRITE                        READ                    WRITE
/db01/file1           0      0      0      0      0.0      0.0
/db01/file2           0      0      0      0     0.0      0.0
/db01/file3           73017      181735        718528           1114227        26.8      27.9
/db01/file4             13197       20252        105569           162009        25.8        397.0
/db01/file5           0      0      0      0        0.0      0.0

For detailed information on available options, see the qiostat(1M) manual page.

Using I/O Statistics Data

Once you gather the file I/O performance data, you can use it to adjust the system configuration to make the most efficient use of system resources. There are three primary statistics to consider:

file I/O activity
volume I/O activity
raw disk I/O activity

If your database is using one file system on a striped volume, you may only need to pay attention to the file I/O activity statistics. If you have more than one file system, you may need to monitor volume I/O activity as well.

First, use the qiostat -r command to clear all existing statistics. After clearing the statistics, let the database run for a while during a typical database workload period. For example, if you are monitoring a database with many users, let the statistics accumulate for a few hours during prime working time before displaying the accumulated I/O statistics.

To display active file I/O statistics, use the qiostat command and specify an interval (using -i) for displaying the statistics for a period of time. This command displays a list of statistics such as:

                                            OPERATIONS                                                FILE BLOCKS                                                 AVG TIME(ms)
    FILENAME                                    READ                    WRITE                            READ                    WRITE                              READ                    WRITE
    /db01/cust1                                     218                         36                         872                        144                        22.8                     55.6
    /db01/hist1                                         0                        1                         0                        4                     0.0                     10.0
    /db01/nord1                                        10                     14                            40                     56                        21.0                     75.0
    /db01/ord1                                        19                     16                            76                     64                        17.4                     56.2
    /db01/ordl1                                     189                         41                         756                        164                        21.1                     50.0
    /db01/roll1                                         0                     50                             0                    200                         0.0                     49.0
    /db01/stk1                                    1614                        238                        6456                        952                        19.3                     46.5
    /db01/sys1                                         0                        0                         0                        0                     0.0                        0.0
    /db01/temp1                                         0                        0                         0                        0                     0.0                        0.0
    /db01/ware1                                         3                     14                            12                     56                        23.3                     44.3
    /logs/log1                                         0                        0                         0                        0                     0.0                        0.0
    /logs/log2                                         0                    217                             0                 2255                             0.0                        6.8

File I/O statistics help identify files with an unusually large number of operations or excessive read or write times. When this happens, try moving the "hot" files or busy file systems to different disks or changing the layout to balance the I/O load.

Obtaining File I/O Statistics using VERITAS Extension for Oracle Disk Manager

The odmstat command provides access to activity information on Oracle Disk Manager files on VxFS systems. Refer to the odmstat(1M) manual page for more information. The command reports statistics on the activity from the time that the files were opened by the Oracle Disk Manager interface. The command has an option for zeroing the statistics. When the file is closed, the statistics are discarded.

The odmstat command displays the following I/O statistics:

Number of read and write operations
Number of data blocks read and written
Average time spent on read and write operations

The following is an example of odmstat output:

# /opt/VRTS/bin/odmstat -i 5 /mnt/odmfile*
  OPERATIONS          FILE BLOCKS    AVG TIME(ms)
FILE NAME               READ     WRITE      READ     WRITE          READ                  WRITE

Mon May 11 16:21:10 2015
/db/cust.dbf                0         0         0                 0          0.0     0.0
/db/system.dbf                0         0         0                 0          0.0     0.0
Mon May 11 16:21:15 2015
/db/cust.dbf               371        0      371       0          0.2     0.0
/db/system.dbf               0         371         0             371         0.0      5.7

Mon May 11 16:21:20 2015
/db/cust.dbf               813        0      813       0          0.3     0.0
/db/system.dbf                0     813        0      813           0.0     5.5

Mon May 11 16:21:25 2015
/db/cust.dbf               816         0            816       0          0.3     0.0
/db/system.dbf                0       816         0                816           0.0     5.3

Mon May 11 16:21:30 2015
/db/cust.dbf                0         0         0                 0          0.0     0.0
/db/system.dbf                0         0         0                 0          0.0     0.0

Interpreting I/O Statistics

When running your database through the file system, the read-write lock on each file allows only one active write per file. When you look at the disk statistics using iostat, the disk reports queueing time and service time. The service time is the time that I/O spends on the disk, and the queueing time is how long it waits for all of the other I/Os ahead of it. At the volume level or the file system level, there is no queueing, so vxstat and qiostat do not show queueing time.

For example, if you send 100 I/Os at the same time and each takes 10 milliseconds, the disk reports an average of 10 milliseconds of service and 490 milliseconds of queueing time. The vxstat, odmstat, and qiostat report an average of 500 milliseconds service time.


^ Return to Top	< Previous \| Next >

Product: Storage Foundation for Databases Guides
Manual: Storage Foundation 4.1 for Oracle Administrator's Guide
VERITAS Software Corporation www.veritas.com