Tips and Recommendations for Storage Server Tuning

Here are some tips and recommendations on how to improve the performance of your storage servers. As usual, the optimal settings depend on your particular hardware and usage scenarios, so you should use these settings only as a starting point for your tuning efforts.

Note: Some of the settings suggested here are non-persistent and will be reverted after the next reboot. To keep them permanently, you could add the corresponding commands to /etc/rc.local, use /etc/sysctl.conf or create udev rules to reapply them automatically when the machine boots.

Table of Contents (Page)

  1. Partition Alignment & RAID Settings of Local File System
  2. Storage Server Throughput Tuning
  3. System BIOS & Power Saving
  4. Concurrency Tuning

Partition Alignment & RAID Settings of Local File System

To get the maximum performance out of your RAID arrays and SSDs, it is important to set the partition offset according to the native alignment. See here for a walk-through about partition alignment and creation of a RAID-optimized local file system: Partition Alignment Guide

A very simple and commonly used method to achieve alignment without the challenges of partition alignment is to completely avoid partitioning and instead create the file system directly on the device, e.g.:
$ mkfs.xfs /dev/sdX

Storage Server Throughput Tuning

In general, BeeGFS can be used with any of the standard Linux file systems.

Using XFS for your storage server data partition is generally recommended, because it scales very well for RAID arrays and typically delivers a higher sustained write throughput on fast storage, compared to alternative file systems. (There also have been significant improvements of ext4 streaming performance in recent Linux kernel versions).

However, the default Linux kernel settings are rather optimized for single disk scenarios with low IO concurrency, so there are various settings that need to be tuned to get the maximum performance out of your storage servers.

Formatting Options

Make sure to enable RAID optimizations of the underlying file system, as described in the last section here: Create RAID-optimized File System

While BeeGFS uses dedicated metadata servers to manage global metadata, the metadata performance of the underlying file system on storage servers still matters for operations like file creates, deletes, small reads/writes, etc. Recent versions of XFS (similar work in progress for ext4) allow inlining of data into inodes to avoid the need for additional blocks and the corresponding expensive extra disk seeks for directories. In order to use this efficiently, the inode size should be increased to 512 bytes or larger.

Example: mkfs for XFS with larger inodes on 8 disks (where the number 8 does not include the number of RAID-5 or RAID-6 parity disks) and 128KB chunk size:
$ mkfs.xfs -d su=128k,sw=8 -l version=2,su=128k -isize=512 /dev/sdX

Mount Options

Enabling last file access time is inefficient, because it means that the file system needs to update the timestamp by writing data to the disk even though the user only actually read file contents or even when the file contents were already cached in memory and actually no disk access would have been necessary at all. (Note: Recent Linux kernels have switched to a new "relative atime" mode, so setting noatime might not be necessary in these cases.)
If your users don't need last access times, you should disable them by adding "noatime" to your mount options.

Increasing the number of log buffers and their size by adding logbufs and logbsize mount options allows XFS to generally handle and enqueue pending file and directory operations more efficiently.

There are also several mount options for XFS that are intended to further optimize streaming performance on RAID storage, such as largeio, inode64, and swalloc.

If you are using XFS and want to go for optimal streaming write throughput, you might also want to add the mount option allocsize=131072k to reduce the risk of fragmentation for large files. Please, consider that this setting could have a significant impact on the interim space usage in systems with many parallel write and create operations.

If your RAID controller has a battery-backup-unit (BBU) or similar technology to protect the cache contents on power loss, adding the mount option nobarrier for XFS or ext4 can significantly increase throughput. Make sure to disable the individual internal caches of the attached disks in the controller settings, as these are not protected by the RAID controller battery.

Example: Typical XFS mount options for an BeeGFS storage server with a RAID controller battery:
$ mount -onoatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint>

IO Scheduler

First, set an appropriate IO scheduler for file servers:
$ echo deadline > /sys/block/sdX/queue/scheduler

Now give the IO scheduler more flexibility by increasing the number of schedulable requests:
$ echo 4096 > /sys/block/sdX/queue/nr_requests

To improve throughput for sequential reads, increase the maximum amount of read-ahead data. The actual amount of read-ahead is adaptive, so using a high value here won't harm performance for small random access.
$ echo 4096 > /sys/block/sdX/queue/read_ahead_kb

Virtual memory settings

To avoid long IO stalls (latencies) for write cache flushing in a production environment with very different workloads, you will typically want to limit the kernel dirty (write) cache size:
$ echo 5 > /proc/sys/vm/dirty_background_ratio
$ echo 10 > /proc/sys/vm/dirty_ratio

Only for special use-cases: If you are going for optimial sustained streaming performance, you may instead want to use different settings that start asynchronous writes of data very early and allow the major part of the RAM to be used for write caching. (For generic use-cases, use the settings described above, instead.)
$ echo 1 > /proc/sys/vm/dirty_background_ratio
$ echo 75 > /proc/sys/vm/dirty_ratio

Assigning slightly higher priority to inode caching helps to avoid disk seeks for inode loading:
$ echo 50 > /proc/sys/vm/vfs_cache_pressure

Buffering of file system data requires frequent memory allocation. Raising the amount of reserved kernel memory will enable faster and more reliable memory allocation in critical situations. Raise the corresponding value to 64MB if you have less than 8GB of memory, otherwise raise it to at least 256MB:
$ echo 262144 > /proc/sys/vm/min_free_kbytes

Transparent huge pages can cause performance degradation under high load, due to the frequent change of file system cache memory areas. For RHEL 5.x, RHEL 6.x and derivatives, it is recommended to disable default transparent huge pages support, unless huge pages are explicity requested by an application through madvise:
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag

For RHEL 7.x and other distributions, it is recommended to have transparent huge pages enabled:

$ echo always > /sys/kernel/mm/transparent_hugepage/enabled
$ echo always > /sys/kernel/mm/transparent_hugepage/defrag

Controller Settings

Optimal performance for hardware RAID systems often depends on large IOs being sent to the device in a single large operation. Please refer to your hardware storage vendor for the corresponding optimal size of /sys/block/sdX/max_sectors_kb. It is typically good if this size can be increased to at least match your RAID stripe set size (i.e. chunk_size x number_of_disks):
$ echo 1024 > /sys/block/sdX/queue/max_sectors_kb

Furthermore, high values of sg_tablesize (/sys/class/scsi_host/.../sg_tablesize) are recommended to allow large IOs. Those values depend on controller firmware versions, kernel versions and driver settings.

System BIOS & Power Saving

To allow the Linux kernel to correctly detect the system properties and enable corresponding optimizations (e.g. for NUMA systems), it is very important to keep your system BIOS updated.

The dynamic CPU clock frequency scaling feature for power saving, which is typically enabled by default, has a high impact on latency. Thus, it is recommended to turn off dynamic CPU frequency scaling. Ideally, this is done in the machine BIOS, where you will often find a general setting like "Optimize for performance".

If frequency scaling is not disabled in the machine BIOS, recent Intel CPU generations require the parameter "intel_pstate=disable" to be added to the kernel boot command line, which is typically defined in the grub boot loader configuration file. After changing this setting, the machine needs to be rebooted.

If the Intel pstate driver is disabled or not applicable to a system, frequency scaling can be changed at runtime, e.g. via:
$ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null

You can check if CPU frequency scaling is disabled by using the following command on an idle system. If it is disabled, you should see the full clock frequency in the "cpu MHz" line for all CPU cores and not a reduced value.
$ cat /proc/cpuinfo

Concurrency Tuning

Worker Threads

Storage servers, metadata servers and clients allow you to control the number of worker threads by setting the value of tuneNumWorkers (in /etc/beegfs/beegfs-X.conf). In general, a higher number of workers allows for more parallelism (e.g. a server will work on more client requests in parallel). But a higher number of workers also results in more concurrent disk access, so especially on the storage servers, the ideal number of workers may depend on the number of disks that you are using.

Back to User Guide - Tuning and Advanced Configuration
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki